arxiv compressed, 2024-03-22

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-03-22 generated by the compressor, my personal LLM-based project.


On Pretraining Data Diversity for Self-Supervised Learning

http://arxiv.org/abs/2403.13808v1

Compressor summary: Increasing data diversity in self-supervised learning improves performance only when the distribution distance to downstream data is small.


Editing Massive Concepts in Text-to-Image Diffusion Models

http://arxiv.org/abs/2403.13807v1

Compressor summary: EMCID is a two-stage method that edits massive concepts in text-to-image diffusion models, addressing risks such as outdated content and copyright infringement, while offering scalability for real-world applications.


RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

http://arxiv.org/abs/2403.13805v1

Compressor summary: CLIP and MLLMs have complementary strengths for vision-language recognition tasks; RAR combines them to improve accuracy on fine-grained, few-shot, and zero-shot recognition.


RadSplat: Radiance Field-Informed Gaussian Splatting for Robust Real-Time Rendering with 900+ FPS

http://arxiv.org/abs/2403.13806v1

Compressor summary: RadSplat is a lightweight real-time rendering method that leverages radiance fields for scene representation and optimization, pruning points for compactness, and test-time filtering for speed, achieving state-of-the-art results on complex scenes.


Learning from Models and Data for Visual Grounding

http://arxiv.org/abs/2403.13804v1

Compressor summary: SynGround is a framework that enhances vision-and-language models by combining data-driven learning, knowledge transfer, and mask-attention consistency to improve grounding capabilities and performance on pointing games.


Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments

http://arxiv.org/abs/2403.13803v1

Compressor summary: The authors propose a box stability score (BoS) that reflects the stability of bounding boxes in object detection, which correlates with detector accuracy and can be used to assess detectors without test ground truths.


ZigMa: Zigzag Mamba Diffusion Model

http://arxiv.org/abs/2403.13802v1

Compressor summary: The study proposes Zigzag Mamba, a method that improves speed and memory utilization for visual data generation by addressing spatial continuity issues in the State-Space Model Mamba.


TimeRewind: Rewinding Time with Image-and-Events Video Diffusion

http://arxiv.org/abs/2403.13800v1

Compressor summary: The paper presents a method to generate videos by rewinding time from a single image using neuromorphic event cameras and diffusion models, showing promising results for capturing missed moments in computer vision and photography.


Reverse Training to Nurse the Reversal Curse

http://arxiv.org/abs/2403.13799v1

Compressor summary: The paper proposes reverse training for large language models to improve their ability to handle reverse relations, such as "B is a feature of A", by doubling the available tokens and training in both forward and reverse directions.


Hierarchical NeuroSymbolic Approach for Action Quality Assessment

http://arxiv.org/abs/2403.13798v1

Compressor summary: The text introduces a neuro-symbolic approach for action quality assessment using computer vision that is more transparent, unbiased, and informative than existing neural models, and applies it to diving.


Bridge the Modality and Capacity Gaps in Vision-Language Model Selection

http://arxiv.org/abs/2403.13797v1

Compressor summary: SWAB is a method that uses optimal transport to transfer statistics between open-source and target datasets, improving zero-shot image classification by selecting the best Pre-Trained VLM from the VLM Zoo based on text data only.


Evaluating Frontier Models for Dangerous Capabilities

http://arxiv.org/abs/2403.13793v1

Compressor summary: The authors evaluate Gemini 1.0 AI models on four potential dangerous capabilities and find no strong evidence of risk, but highlight early warning signs.


DepthFM: Fast Monocular Depth Estimation with Flow Matching

http://arxiv.org/abs/2403.13788v1

Compressor summary: The paper proposes a generative method for monocular depth estimation that uses image diffusion as a prior and flow matching to efficiently map from input images to depth maps, achieving state-of-the-art results with low computational cost.


RewardBench: Evaluating Reward Models for Language Modeling

http://arxiv.org/abs/2403.13787v1

Compressor summary: RewardBench is a benchmark dataset and code-base to evaluate reward models in language models alignment, revealing their strengths and weaknesses in chat, reasoning, and safety tasks.


Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts

http://arxiv.org/abs/2403.13786v1

Compressor summary: The Chain-of-Interaction (CoI) prompting method helps large language models understand patient behaviors during motivational interviewing by considering the coding scheme, therapist question strategies, and dyadic interactions between patients and therapists.


Towards an extension of Fault Trees in the Predictive Maintenance Scenario

http://arxiv.org/abs/2403.13785v1

Compressor summary: The paper presents an extension of Fault Trees to model and analyze Predictive Maintenance problems in modern systems.


The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency and Usability in AI

http://arxiv.org/abs/2403.13784v1

Compressor summary: The Model Openness Framework (MOF) is a system to rate and promote openness in generative AI models, addressing concerns about transparency, reproducibility, bias, and safety.


Sparse Implementation of Versatile Graph-Informed Layers

http://arxiv.org/abs/2403.13781v1

Compressor summary: This paper proposes a sparse implementation of Graph-Informed layers for Graph Neural Networks that reduces memory usage and improves computational efficiency, enabling deeper and more scalable models on large graphs.


Information-Theoretic Distillation for Reference-less Summarization

http://arxiv.org/abs/2403.13780v1

Compressor summary: InfoSumm is a novel framework that distills a powerful summarizer using an information-theoretic objective without relying on large-scale language models or human references.


Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models

http://arxiv.org/abs/2403.13771v1

Compressor summary: The paper introduces Describe-and-Dissect (DnD), a method that uses multimodal deep learning to generate natural language descriptions of hidden neurons in vision networks without training data or predefined concepts, and shows its superiority over prior work.


Towards Principled Representation Learning from Videos for Reinforcement Learning

http://arxiv.org/abs/2403.13765v1

Compressor summary: The paper investigates the theoretical and empirical aspects of learning latent state representations from video data for decision-making tasks, finding that temporal contrastive learning and forward modeling perform well in settings with only iid noise but struggle with exogenous noise.


Practical End-to-End Optical Music Recognition for Pianoform Music

http://arxiv.org/abs/2403.13763v1

Compressor summary: The paper proposes a new format for optical music recognition (OMR) models, called Linearized MusicXML, which allows them to read input images and produce a linear sequence of tokens compatible with industry standards, and creates a benchmark dataset based on the OpenScore Lieder corpus.


HierCode: A Lightweight Hierarchical Codebook for Zero-shot Chinese Text Recognition

http://arxiv.org/abs/2403.13761v1

Compressor summary: HierCode is a lightweight codebook that uses a multi-hot encoding strategy to represent Chinese characters hierarchically and facilitate zero-shot recognition of out-of-vocabulary characters, achieving state-of-the-art performance with fast inference speed.


Enhancing Gait Video Analysis in Neurodegenerative Diseases by Knowledge Augmentation in Vision Language Model

http://arxiv.org/abs/2403.13756v1

Compressor summary: The authors propose a model that uses a pre-trained vision language model to improve the understanding of patient gait videos by learning from text, video, and numerical data, achieving better performance than previous methods.


Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

http://arxiv.org/abs/2403.13754v1

Compressor summary: The study examines how different ways of splitting words affect Spanish plural nouns' agreement, finding that a method based on word structure performs similarly to other methods and is not necessary for good performance.


Weisfeiler and Leman Go Loopy: A New Hierarchy for Graph Representational Learning

http://arxiv.org/abs/2403.13749v1

Compressor summary: The paper introduces a new graph isomorphism test hierarchy, $r$-loopy Weisfeiler-Leman (r-WL), that can count homomorphisms of cactus graphs and a corresponding GNN framework, r-MPNN, which performs well on various datasets.


Leveraging High-Resolution Features for Improved Deep Hashing-based Image Retrieval

http://arxiv.org/abs/2403.13747v1

Compressor summary: The authors propose HHNet, a method that uses High-Resolution Networks (HRNets) to learn high-resolution features for efficient image retrieval, outperforming existing methods on various datasets.


Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

http://arxiv.org/abs/2403.13745v1

Compressor summary: MOTIA is a diffusion-based method that adapts to input videos and uses learned patterns for effective video outpainting, achieving superior results with minimal tuning.


Uncertainty-Aware Explanations Through Probabilistic Self-Explainable Neural Networks

http://arxiv.org/abs/2403.13740v1

Compressor summary: Prob-PSENN is a probabilistic reformulation of Deep Neural Networks that offers transparent, flexible, and uncertain prediction explanations using probability distributions over prototypes.


EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation

http://arxiv.org/abs/2403.13737v1

Compressor summary: The paper introduces EthioLLM, multilingual large language models for five Ethiopian languages and English, and Ethiobenchmark, a new benchmark dataset for various NLP tasks, to improve the state of low-resource language NLP in Ethiopia.


M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization via Multiplier Induced Loss Landscape Scheduling

http://arxiv.org/abs/2403.13728v1

Compressor summary: The authors propose a probabilistic graphical model to optimize neural network parameters and weight multipliers jointly, using a hypervolume based likelihood that promotes descent of each loss term, resulting in a multiplier-free method that saves computational resources and outperforms other methods.


Probabilistic Forecasting with Stochastic Interpolants and Föllmer Processes

http://arxiv.org/abs/2403.13724v1

Compressor summary: Key points: - Framework for probabilistic forecasting based on generative modeling of dynamical systems - Stochastic interpolants enable construction of a generative model between base and target distributions - Fictitious non-physical stochastic dynamics produces samples from target conditional distribution - Drift coefficient can be learned efficiently by square loss regression - Approach works on complex, high-dimensional problems like fluid dynamics and video prediction Summary: The paper presents a generative modeling framework for probabilistic forecasting of dynamical systems using stochastic interpolants and a non-physical stochastic dynamics that learns the drift coefficient from data and handles complex problems.


Research Re: search & Re-search

http://arxiv.org/abs/2403.13705v1

Compressor summary: The text discusses two types of search algorithms (depth-first and best-first) for minimax games and challenges the prevailing opinion that best-first algorithms are more efficient but less practical than depth-first algorithms.


Fostc3net:A Lightweight YOLOv5 Based On the Network Structure Optimization

http://arxiv.org/abs/2403.13703v1

Compressor summary: The paper proposes a lightweight YOLOv5 technique for detecting transmission line objects on mobile devices with improved efficiency and accuracy by integrating C3Ghost and FasterNet modules and using wIoUv3 loss function.


SPTNet: An Efficient Alternative Framework for Generalized Category Discovery with Spatial Prompt Tuning

http://arxiv.org/abs/2403.13684v1

Compressor summary: SPTNet adapts both model and data representation for generalized category discovery, using a novel spatial prompt tuning method that focuses on object parts and achieves state-of-the-art performance with minimal additional parameters.


DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses

http://arxiv.org/abs/2403.13683v1

Compressor summary: The Deep Voxel Matching Network (DVMNet) is a method that computes the relative pose of an object between two images in 3D space using voxels and a least-squares problem, achieving more accurate results and lower computational cost than existing methods.


PARAMANU-AYN: An Efficient Novel Generative and Instruction-tuned Language Model for Indian Legal Case Documents

http://arxiv.org/abs/2403.13681v1

Compressor summary: The paper introduces PARAMANU-AYN, a legal language model based on Indian laws and constitution that can perform various legal tasks without much data or fine-tuning.


RoleInteract: Evaluating the Social Interaction of Role-Playing Agents

http://arxiv.org/abs/2403.13679v1

Compressor summary: RoleInteract is a new benchmark to evaluate social intelligence in AI conversational agents that mimic diverse characters and human behaviors.


AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

http://arxiv.org/abs/2403.13678v1

Compressor summary: The paper proposes a novel audio-visual approach to improve facial action unit detection by using advanced features, adaptive fusion, and context-aware modeling.


Retina Vision Transformer (RetinaViT): Introducing Scaled Patches into Vision Transformers

http://arxiv.org/abs/2403.13677v1

Compressor summary: The proposed Retina Vision Transformer (RetinaViT) model incorporates low spatial frequency components in the input to improve visual scene formation and achieve better performance than the original ViT on ImageNet-1K dataset.


Machine Learning Optimized Approach for Parameter Selection in MESHFREE Simulations

http://arxiv.org/abs/2403.13672v1

Compressor summary: The paper presents an ML-optimized approach to optimize parameters in meshfree simulation software, improving its usability and performance for various applications.


DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

http://arxiv.org/abs/2403.13667v1

Compressor summary: The paper introduces DCM, a new dataset that combines camera movement, dance, and music, and proposes DanceCamera3D, a transformer-based model for synthesizing dance camera movements.


Grounding Spatial Relations in Text-Only Language Models

http://arxiv.org/abs/2403.13666v1

Compressor summary: Text-only Language Models can learn spatial relations with locations, outperforming Vision-and-Language Models on a verbalized VSR dataset.


T-Pixel2Mesh: Combining Global and Local Transformer for 3D Mesh Generation from a Single Image

http://arxiv.org/abs/2403.13663v1

Compressor summary: T-Pixel2Mesh is a new method to reconstruct 3D shapes from images using Transformers that improves details and works better with real-world data.


ProMamba: Prompt-Mamba for polyp segmentation

http://arxiv.org/abs/2403.13660v1

Compressor summary: The text introduces a new segmentation model for polyp detection in medical images that uses Prompt-Mamba, which combines Vision-Mamba and prompt technologies to achieve high accuracy and generalization across different datasets.


Multimodal Variational Autoencoder for Low-cost Cardiac Hemodynamics Instability Detection

http://arxiv.org/abs/2403.13658v1

Compressor summary: The paper proposes a novel multimodal variational autoencoder (CardioVAE) that integrates chest X-ray and electrocardiogram data to predict cardiac hemodynamic instability, improving performance and interpretability over single-modality methods.


Learning User Embeddings from Human Gaze for Personalised Saliency Prediction

http://arxiv.org/abs/2403.13653v1

Compressor summary: The paper proposes a method to learn user embeddings for personalized saliency prediction using natural images and eye tracking data, improving performance and generalization.


ZoDi: Zero-Shot Domain Adaptation with Diffusion-Based Image Transfer

http://arxiv.org/abs/2403.13652v1

Compressor summary: ZoDi is a zero-shot domain adaptation method that uses diffusion models to synthesize target-like images and train a model with both source and synthesized images, improving image segmentation performance without target images.


Meta-Point Learning and Refining for Category-Agnostic Pose Estimation

http://arxiv.org/abs/2403.13647v1

Compressor summary: The proposed method predicts potential keypoints for arbitrary objects using learnable embeddings, and refines them with support keypoints and a slacked regression loss.


H-vmunet: High-order Vision Mamba UNet for Medical Image Segmentation

http://arxiv.org/abs/2403.13642v1

Compressor summary: The paper proposes a new model, H-vmunet, for medical image segmentation that improves the 2D-selective-scan (SS2D) module by adding higher-order interactions and a Local-SS2D module to enhance local feature learning.


Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

http://arxiv.org/abs/2403.13638v1

Compressor summary: The paper explores using machine translation to create synthetic data for pre-training language models in low-resource languages, showing improved performance on NLU and NLG tasks with efficient filtering and extended pretraining.


Enhancing Law Enforcement Training: A Gamified Approach to Detecting Terrorism Financing

http://arxiv.org/abs/2403.13625v1

Compressor summary: The study describes how using learning and training methods with gamification can help law enforcement and other professionals better understand and combat cyber-crime and terrorism financing.


Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

http://arxiv.org/abs/2403.13612v1

Compressor summary: The study evaluates the effectiveness of various methods for generating private synthetic biomedical data, focusing on the Mann-Whitney U test's ability to maintain validity and power when applied to such data.


VL-Mamba: Exploring State Space Models for Multimodal Learning

http://arxiv.org/abs/2403.13600v1

Compressor summary: The paper introduces VL-Mamba, a more efficient multimodal language model based on state space models that can handle long sequences with fast inference and linear scaling.


Llama meets EU: Investigating the European Political Spectrum through the Lens of LLMs

http://arxiv.org/abs/2403.13592v1

Compressor summary: The authors study the political leanings and knowledge of Llama Chat, a large language model, when fine-tuned on EU politics and suggest it could be used for research purposes.


Teacher-Student Training for Debiasing: General Permutation Debiasing for Large Language Models

http://arxiv.org/abs/2403.13590v1

Compressor summary: The paper proposes efficient methods to transfer knowledge from debiased large language models to smaller, more reliable ones, improving performance while reducing parameters and computational cost.


ReGround: Improving Textual and Spatial Grounding at No Cost

http://arxiv.org/abs/2403.13589v1

Compressor summary: The image diffusion model that integrates gated self-attention into the U-Net shows a bias towards spatial cues over text cues, but this can be improved by changing the network architecture without fine-tuning.


Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation

http://arxiv.org/abs/2403.13578v1

Compressor summary: The paper proposes two new bandit methods to improve natural language generation by jointly optimizing multiple text qualities in counselor reflection generation.


Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

http://arxiv.org/abs/2403.13570v1

Compressor summary: The paper presents a new learning approach for creating realistic 4D head avatars using pseudo multi-view videos and a vision transformer backbone, outperforming previous methods in various aspects.


eRST: A Signaled Graph Theory of Discourse Relations and Organization

http://arxiv.org/abs/2403.13560v1

Compressor summary: The article introduces eRST, a new discourse analysis framework that improves on RST by including more relation types, implicit and explicit signals, and provides tools and a large corpus to support it.


Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

http://arxiv.org/abs/2403.13556v1

Compressor summary: The paper proposes a method to improve LiDAR-based 3D object detection in urban environments by using open-vocabulary learning and multi-sensor data with pre-trained vision-language models, achieving better novel object recall and reducing bias towards camera-proximal objects.


Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing

http://arxiv.org/abs/2403.13551v1

Compressor summary: Ground-A-Score is a model-agnostic image editing method that incorporates grounding during score distillation to accurately reflect complex text prompts and preserve object integrity in edited images.


Diversity-aware Channel Pruning for StyleGAN Compression

http://arxiv.org/abs/2403.13548v1

Compressor summary: Our method prunes channels based on their sensitivities to latent vector perturbations, improving sample diversity and FID scores in compressed StyleGAN models without extra training cost.


Integrating Large Language Models for Severity Classification in Traffic Incident Management: A Machine Learning Approach

http://arxiv.org/abs/2403.13547v1

Compressor summary: The study shows that using features from large language models can improve or match the accuracy of predicting traffic incident severity using conventional machine learning algorithms.


Next day fire prediction via semantic segmentation

http://arxiv.org/abs/2403.13545v1

Compressor summary: The paper proposes a deep learning method for predicting fire risk in an area based on images representing daily snapshots and various features.


What explains the success of cross-modal fine-tuning with ORCA?

http://arxiv.org/abs/2403.13537v1

Compressor summary: ORCA is a cross-modal fine-tuning technique that performs well on 1D tasks but not 2D tasks, and model fine-tuning is more important than embedder training for its success.


IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

http://arxiv.org/abs/2403.13535v1

Compressor summary: IDAdapter is a new method that creates personalized and diverse avatars from a single face image, using textual and visual inputs and preserving identity details.


Compress3D: a Compressed Latent Space for 3D Generation from a Single Image

http://arxiv.org/abs/2403.13524v1

Compressor summary: The paper introduces a triplane autoencoder with a 3D-aware cross-attention mechanism and a diffusion model for efficient compression and high-speed generation of 3D models from images, achieving superior performance compared to existing methods.


Have You Poisoned My Data? Defending Neural Networks against Data Poisoning

http://arxiv.org/abs/2403.13523v1

Compressor summary: Key points: - Large amounts of training data lead to potential threats like poisoning attacks - Paper proposes a new way to detect and filter poisoned datapoints in transfer learning setting - New characteristic vector representation captures intrinsic properties of data distribution - Experiments show that the proposal outperforms existing defenses Summary: The paper presents a novel approach using characteristic vectors to defend against clean-label poisoning attacks on neural networks in transfer learning settings and shows its effectiveness in experiments.


REAL: Representation Enhanced Analytic Learning for Exemplar-free Class-incremental Learning

http://arxiv.org/abs/2403.13522v1

Compressor summary: REAL is a novel method for exemplar-free class-incremental learning that enhances representation by combining supervised and self-supervised learning, distillation, and analytic learning.


Motion Generation from Fine-grained Textual Descriptions

http://arxiv.org/abs/2403.13518v1

Compressor summary: Text2motion aims to create motion sequences from fine-grained textual descriptions, using FineHumanML3D dataset and FineMotionDiffuse model trained with GPT-3.5-turbo.


How Gender Interacts with Political Values: A Case Study on Czech BERT Models

http://arxiv.org/abs/2403.13514v1

Compressor summary: The study examines political biases in Czech neural language models and finds no systematic alignment with values, but rather superficial imitation of training data patterns.


What if...?: Counterfactual Inception to Mitigate Hallucination Effects in Large Multimodal Models

http://arxiv.org/abs/2403.13513v1

Compressor summary: The paper introduces Counterfactual Inception, a method to reduce hallucination in large multimodal models by injecting counterfactual thoughts using misaligned keywords, and Dual-modality Verification Process for selecting optimal keywords.


Scale Decoupled Distillation

http://arxiv.org/abs/2403.13512v1

Compressor summary: SDD improves logit knowledge distillation by separating global logits into local ones and transferring unambiguous, fine-grained knowledge to the student.


FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

http://arxiv.org/abs/2403.13507v1

Compressor summary: The paper introduces a new adversarial attack on video-based language models that can trick them into generating wrong or nonsensical answers by adding subtle perturbations to videos.


VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

http://arxiv.org/abs/2403.13501v1

Compressor summary: The paper introduces VSTAR, a method that improves the generation of longer, more dynamic videos from text using generative temporal nursing with video synopsis prompting and temporal attention regularization.


Improved Baselines for Data-efficient Perceptual Augmentation of LLMs

http://arxiv.org/abs/2403.13499v1

Compressor summary: The text describes how large language models can be used for various computer vision tasks by connecting them to perceptual backbones, and presents an experimental evaluation of different interfacing mechanisms that improves performance and reduces training time.


An Entropy-based Text Watermarking Detection Method

http://arxiv.org/abs/2403.13485v1

Compressor summary: The paper proposes an Entropy-based Watermark Detection (EWD) for large language models that considers token entropy during detection and improves performance in low-entropy scenarios.


A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

http://arxiv.org/abs/2403.13480v1

Compressor summary: The paper proposes a new framework for cross-modal retrieval that uses optimal transport to align semantics, correct noisy labels, and narrow the heterogeneous gap between modalities.


Deepfake Detection without Deepfakes: Generalization via Synthetic Frequency Patterns Injection

http://arxiv.org/abs/2403.13479v1

Compressor summary: The paper introduces a new method for training deepfake detectors that can recognize various techniques by injecting crafted frequency patterns into pristine images, improving their generalization capabilities.


Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion

http://arxiv.org/abs/2403.13470v1

Compressor summary: Key points: - The paper proposes using diffusion models to complete 3D LiDAR scenes from a single scan - The method operates directly on the points, not on range images extracted from LiDAR - A regularization loss is added to stabilize the noise prediction during denoising - The results show more detailed and complete scenes than existing methods Summary: The paper presents a diffusion model approach that completes 3D LiDAR scenes directly on points, with a regularization loss to reduce noise, achieving better results than previous scene completion methods.


Progressive trajectory matching for medical dataset distillation

http://arxiv.org/abs/2403.13469v1

Compressor summary: The paper proposes a novel method to create a synthetic medical image dataset from the original one while preserving useful information, improving training stability, diversity, and performance using progressive trajectory matching and dynamic overlap mitigation.


An AI-Assisted Skincare Routine Recommendation System in XR

http://arxiv.org/abs/2403.13466v1

Compressor summary: Key points: - The paper presents an AI-assisted skin care recommendation system integrated into an XR platform - The system uses a CNN model to analyse skin type and recommend personalised products based on facial image and questionnaire data - The system achieves high accuracy in classifying skin issues and can provide immersive and engaging experiences to users Summary: The paper proposes an AI-powered XR platform that analyses users' skin type and recommends personalised products using a facial image and a questionnaire, providing immersive and engaging skincare experiences.


HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

http://arxiv.org/abs/2403.13447v1

Compressor summary: HyperLLaVA improves multimodal task performance by dynamically adjusting visual and language experts using HyperNetworks, unlike static tuning in LLaVA.


MedCycle: Unpaired Medical Report Generation via Cycle-Consistency

http://arxiv.org/abs/2403.13444v1

Compressor summary: The paper proposes a novel method for generating medical reports from X-ray images without paired data, using cycle-consistent mapping functions and report auto-encoding, leading to improved results in chest X-ray report generation.


Fast-Poly: A Fast Polyhedral Framework For 3D Multi-Object Tracking

http://arxiv.org/abs/2403.13443v1

Compressor summary: Fast-Poly is a fast and effective 3D multi-object tracking method that addresses object rotational anisotropy, enhances local computation densification, and leverages parallelization for improved accuracy and speed on large-scale datasets.


Robustness Verifcation in Neural Networks

http://arxiv.org/abs/2403.13441v1

Compressor summary: The paper studies formal verification problems for neural networks using symbolic specifications and provides a theoretical framework to analyze their complexities in a semi-linear setting.


Advancing 6D Pose Estimation in Augmented Reality -- Overcoming Projection Ambiguity with Uncontrolled Imagery

http://arxiv.org/abs/2403.13434v1

Compressor summary: The study presents a new method for accurate 6D pose estimation in Augmented Reality from uncontrolled RGB images, improving 3D object overlaying and applications in manufacturing and robotics.


Agent Group Chat: An Interactive Group Chat Simulacra For Better Eliciting Collective Emergent Behavior

http://arxiv.org/abs/2403.13433v1

Compressor summary: The Agent Group Chat simulation explores how language influences human collective behavior by creating diverse narrative scenarios and measuring the disorder of agents' interactions.


MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining

http://arxiv.org/abs/2403.13430v1

Compressor summary: This study proposes Multi-Task Pretraining for foundation models in Remote Sensing, which improves downstream tasks by addressing task discrepancy and achieving competitive performance with fewer parameters.


Diversified and Personalized Multi-rater Medical Image Segmentation

http://arxiv.org/abs/2403.13417v1

Compressor summary: The D-Persona framework aims to achieve both diversified and personalized results in multi-rater medical image segmentation by training a Probabilistic U-Net model and using attention-based projection heads.


Cell Tracking in C. elegans with Cell Position Heatmap-Based Alignment and Pairwise Detection

http://arxiv.org/abs/2403.13412v1

Compressor summary: The paper proposes a cell tracking method for C. elegans that handles large migrations due to head movement, inconsistent detections, and low-contrast images by using non-rigid alignment and pairwise detection.


S2DM: Sector-Shaped Diffusion Models for Video Generation

http://arxiv.org/abs/2403.13408v1

Compressor summary: The paper proposes a new video generation method called Sector-Shaped Diffusion Model (S2DM) that maintains consistency and continuity across video frames by using sector-shaped diffusion regions and optical flow as temporal conditions.


DOR3D-Net: Dense Ordinal Regression Network for 3D Hand Pose Estimation

http://arxiv.org/abs/2403.13405v1

Compressor summary: The paper proposes a novel network for 3D hand pose estimation using ordinal regression, which improves accuracy by reducing noise and outliers in large-scale regression offset values.


Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments

http://arxiv.org/abs/2403.13395v1

Compressor summary: The paper proposes UMF, a multi-modal model that improves SLAM performance in challenging environments by cross-attention between vision and LiDAR features and re-ranking based on local feature matching.


Robust image segmentation model based on binary level set

http://arxiv.org/abs/2403.13392v1

Compressor summary: The paper proposes a robust image segmentation model that uses intensity inhomogeneity and binary level set to handle noise and improve performance.


IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis

http://arxiv.org/abs/2403.13378v1

Compressor summary: The paper proposes a novel diffusion model for semantic image synthesis that uses segmentation masks and style reference images, and improves the generation quality with refinement, color-transfer, and model ensembles techniques.


Correlation Clustering of Organoid Images

http://arxiv.org/abs/2403.13376v1

Compressor summary: The paper presents methods for finding similar patterns in microscopy images of heterogeneous organoids using models, algorithms, and clustering techniques.


Few-shot Oriented Object Detection with Memorable Contrastive Learning in Remote Sensing Images

http://arxiv.org/abs/2403.13375v1

Compressor summary: FOMC is a novel FSOD method for remote sensing images that uses oriented bounding boxes and supervised contrastive learning to improve detection performance for arbitrary-oriented objects with limited annotations.


LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

http://arxiv.org/abs/2403.13372v1

Compressor summary: LlamaFactory is a framework that helps users fine-tune large language models efficiently and effectively using a web UI without coding.


Counting Network for Learning from Majority Label

http://arxiv.org/abs/2403.13370v1

Compressor summary: The paper introduces a new problem in multi-class learning, called Learning from the Majority Label, and proposes a counting network to solve it effectively.


Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting

http://arxiv.org/abs/2403.13369v1

Compressor summary: The authors evaluate domain-adapted and prompted lightweight models for medical information extraction from German doctor's letters, achieving high accuracy with minimal training data and ensuring interpretability of predictions using Shapley values.


Computational Models to Study Language Processing in the Human Brain: A Survey

http://arxiv.org/abs/2403.13368v1

Compressor summary: The paper discusses using computational language models in brain research, evaluates their performance, and emphasizes the importance of diverse data and strict experiments.


AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation

http://arxiv.org/abs/2403.13352v1

Compressor summary: AGFSync improves image generation by using VLM to assess and provide feedback to T2I diffusion models in a closed AI-driven loop, leading to better quality images and performance on benchmarks.


OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning

http://arxiv.org/abs/2403.13351v1

Compressor summary: Orthogonal Capsule Network (OrthCaps) reduces redundancy and improves routing performance in CapsNet using pruning, sparse attention routing, and orthogonal weight matrices, achieving competitive results with significantly fewer parameters.


Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection

http://arxiv.org/abs/2403.13349v1

Compressor summary: HGAD is a novel anomaly detection method that uses hierarchical Gaussian mixture modeling to overcome the limitations of existing normalizing flow-based approaches, achieving better performance on real-world datasets.


vid-TLDR: Training Free Token merging for Light-weight Video Transformer

http://arxiv.org/abs/2403.13347v1

Compressor summary: The authors propose vid-TLDR, a lightweight video Transformer that merges background tokens and focuses on salient regions using attention maps and token dropping, to improve efficiency without sacrificing performance.


TiBiX: Leveraging Temporal Information for Bidirectional X-ray and Report Generation

http://arxiv.org/abs/2403.13343v1

Compressor summary: TiBiX is a method that uses temporal information to generate reports and images for chest X-rays, improving the quality of medical information.


FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image Analysis

http://arxiv.org/abs/2403.13341v1

Compressor summary: The paper proposes a hierarchical merging approach to improve medical imaging performance and robustness by aggregating models based on hyperparameter configurations, leading to better results than model soups on in-domain and out-of-distribution tasks.


Adaptive Critical Subgraph Mining for Cognitive Impairment Conversion Prediction with T1-MRI-based Brain Network

http://arxiv.org/abs/2403.13338v1

Compressor summary: The paper proposes Brain-SubGNN, a graph representation network that mines and enhances critical subgraphs based on T1-MRI to improve understanding and diagnosis of early-stage dementia.


Learning Novel View Synthesis from Heterogeneous Low-light Captures

http://arxiv.org/abs/2403.13337v1

Compressor summary: The paper proposes a method to decompose illumination, reflectance, and noise from input views under low-light conditions, enabling better synthesis of novel views with improved visual quality.


Adaptive Ensembles of Fine-Tuned Transformers for LLM-Generated Text Detection

http://arxiv.org/abs/2403.13335v1

Compressor summary: The study shows that combining multiple transformer-based models with adaptive ensemble algorithms can significantly improve the accuracy of detecting fake text generated by large language models on different types of data.


Hyacinth6B: A large language model for Traditional Chinese

http://arxiv.org/abs/2403.13334v1

Compressor summary: The study aims to create a lightweight language model, Hyacinth6B, that performs well without high computational costs by using the LoRA method for efficient finetuning.


AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving

http://arxiv.org/abs/2403.13331v1

Compressor summary: The paper introduces an autoregressive method for motion prediction in autonomous driving using GPT-style next token prediction, factorized attention modules, and position encoding styles to capture spatial-temporal relations.


Efficient scene text image super-resolution with semantic guidance

http://arxiv.org/abs/2403.13330v1

Compressor summary: SGENet is an efficient framework for scene text recognition that combines super-resolution and semantic guidance branches to improve accuracy while keeping low computational costs.


Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion

http://arxiv.org/abs/2403.13327v1

Compressor summary: Our method adapts to handheld video camera motion and improves scene reconstruction using detailed image formation modeling and differentiable rendering with velocities from visual-inertial odometry.


Out-of-Distribution Detection Using Peer-Class Generated by Large Language Model

http://arxiv.org/abs/2403.13324v1

Compressor summary: ODPC uses language models to generate peer classes for OOD detection, improving reliability and security of machine learning models.


DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation

http://arxiv.org/abs/2403.13322v1

Compressor summary: The authors present a comprehensive benchmark for evaluating the adversarial robustness of compressed datasets created by various distillation methods, using different attacks and datasets.


HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling

http://arxiv.org/abs/2403.13319v1

Compressor summary: Key points: - Paper proposes a hypernetwork framework to fuse medical imaging and tabular data for multimodal tasks in healthcare - Method conditions image processing on EHR values and measurements - Outperforms single-modality models and existing fusion methods on brain MRI tasks Summary: The paper introduces a novel hypernetwork approach to merge medical imaging and tabular data, which improves multimodal healthcare applications by enhancing image processing with EHR information.


PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

http://arxiv.org/abs/2403.13315v1

Compressor summary: The paper introduces PuzzleVQA, a dataset to test large multimodal models on abstract patterns, and finds that they struggle with visual perception and inductive reasoning, suggesting limitations in emulating human cognition.


Polaris: A Safety-focused LLM Constellation Architecture for Healthcare

http://arxiv.org/abs/2403.13313v1

Compressor summary: Polaris is a safety-focused system of large language models that can engage in real-time voice conversations with patients, performing better than previous models and human nurses on medical safety and bedside manner.


LeanReasoner: Boosting Complex Logical Reasoning with Lean

http://arxiv.org/abs/2403.13312v1

Compressor summary: The authors use Lean, a theorem proving framework, to improve LLMs' logical reasoning skills by formalizing problems into theorems and solving them with Lean's symbolic solver and library of proofs.


Multi-Robot Connected Fermat Spiral Coverage

http://arxiv.org/abs/2403.13311v1

Compressor summary: The Multi-Robot Connected Fermat Spiral (MCFS) algorithm helps multiple robots navigate around obstacles for efficient area coverage and smooth paths, improving performance in complex environments.


LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment

http://arxiv.org/abs/2403.13307v1

Compressor summary: LaserHuman is a new dataset for generating realistic human motions from natural language descriptions in various 3D environments, with a novel multi-conditional diffusion model that outperforms existing methods.


DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

http://arxiv.org/abs/2403.13304v1

Compressor summary: DetDiffusion is a novel method that combines generative and perceptive models to create high-quality synthetic images for object detection tasks using segmentation and perception-aware attributes.


Rotary Position Embedding for Vision Transformer

http://arxiv.org/abs/2403.13298v1

Compressor summary: RoPE enhances Vision Transformer performance on image tasks by improving resolution and precision, while being computationally efficient.


Building Optimal Neural Architectures using Interpretable Knowledge

http://arxiv.org/abs/2403.13293v1

Compressor summary: AutoBuild is a scheme that learns to assign importance scores to neural architecture modules using latent embeddings, enabling high-quality network construction without costly search.


Text-to-3D Shape Generation

http://arxiv.org/abs/2403.13289v1

Compressor summary: Text-to-3D shape generation is a rapidly advancing field with many challenges, and this report surveys the existing methods and suggests future research directions.


AdaViPro: Region-based Adaptive Visual Prompt for Large-Scale Models Adapting

http://arxiv.org/abs/2403.13282v1

Compressor summary: AdaViPro is a method that learns to add and remove prompts in different regions of an image to fine-tune pre-trained models efficiently.


AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models

http://arxiv.org/abs/2403.13269v1

Compressor summary: AFLoRA is a new fine-tuning method that uses low-rank matrices to adapt pre-trained models with fewer parameters, less computation, and better performance on the GLUE benchmark.


Unifews: Unified Entry-Wise Sparsification for Efficient Graph Neural Network

http://arxiv.org/abs/2403.13268v1

Compressor summary: Unifews is a novel method that jointly sparsifies edges and weights of graph neural networks, reducing computational cost without sacrificing accuracy.


SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

http://arxiv.org/abs/2403.13263v1

Compressor summary: The paper introduces SC-Tune, a novel fine-tuning method for Large Vision Language Models that improves their self-consistency and object-level comprehension abilities.


Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations

http://arxiv.org/abs/2403.13261v1

Compressor summary: The text describes a new method for autonomous driving systems to predict motion using unlabeled LiDAR point clouds and coarse pseudo motion labels, outperforming existing self-supervised methods.


SAMCT: Segment Any CT Allowing Labor-Free Task-Indicator Prompts

http://arxiv.org/abs/2403.13258v1

Compressor summary: The paper introduces SAMCT, a modified segment anything model for medical imaging that improves performance by adding a U-shaped CNN image encoder, cross-branch interaction, and task-indicator prompt encoder to address the lack of medical knowledge and insufficient feature extraction.


Arcee's MergeKit: A Toolkit for Merging Large Language Models

http://arxiv.org/abs/2403.13257v1

Compressor summary: MergeKit is an open-source library that helps merge pre-trained language models to improve performance and versatility without additional training.


Document Author Classification Using Parsed Language Structure

http://arxiv.org/abs/2403.13253v1

Compressor summary: The paper explores using grammatical structure information extracted by a statistical natural language parser to detect authorship of texts, and tests the method on The Federalist Papers and Sanditon.


Facilitating Pornographic Text Detection for Open-Domain Dialogue Systems via Knowledge Distillation of Large Language Models

http://arxiv.org/abs/2403.13250v1

Compressor summary: CensorChat is a dialogue monitoring dataset that uses knowledge distillation of large language models to annotate and develop text classifiers for detecting pornographic content in human-machine interaction dialogues.


A Unified and General Framework for Continual Learning

http://arxiv.org/abs/2403.13249v1

Compressor summary: This paper introduces a unified framework for Continual Learning methods, reveals their common mathematical structures, and proposes refresh learning, an innovative technique inspired by neuroscience to enhance CL performance.


Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

http://arxiv.org/abs/2403.13248v1

Compressor summary: Mora is a new multi-agent framework that mimics Sora's generalist video generation capabilities using several advanced visual AI agents for various tasks.


Divide-Conquer Transformer Learning for Predicting Electric Vehicle Charging Events Using Smart Meter Data

http://arxiv.org/abs/2403.13246v1

Compressor summary: The authors develop a home EV charging prediction method using historical smart meter data, which can help with load scheduling and energy management, and achieve high accuracy.


Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

http://arxiv.org/abs/2403.13244v1

Compressor summary: The text introduces TSMMG, a large language model that generates molecules based on natural language descriptions and performs well across various tasks and styles.


Tackling Noisy Labels with Network Parameter Additive Decomposition

http://arxiv.org/abs/2403.13241v1

Compressor summary: The paper proposes a method to separate clean and noisy data memorization in deep networks using additive parameter decomposition, improving generalization and reducing overfitting.


SumTra: A Differentiable Pipeline for Few-Shot Cross-Lingual Summarization

http://arxiv.org/abs/2403.13240v1

Compressor summary: The paper proposes a summarize-and-translate pipeline for cross-lingual summarization, which leverages existing resources and shows competitive performance with few-shot fine-tuning.


Beyond Skeletons: Integrative Latent Mapping for Coherent 4D Sequence Generation

http://arxiv.org/abs/2403.13238v1

Compressor summary: The paper presents a new framework to generate coherent 4D sequences of 3D shapes with dynamic evolution of shape and color using diffusion models and latent mapping.


Technical Report: Competition Solution For BetterMixture

http://arxiv.org/abs/2403.13233v1

Compressor summary: The paper presents a third-place solution for the BetterMixture challenge, which uses Ke-Data-Juicer to optimize and filter data for large language models.


Diffusion Model for Data-Driven Black-Box Optimization

http://arxiv.org/abs/2403.13219v1

Compressor summary: The paper proposes a method to optimize complex designs using diffusion models, which generate near-optimal solutions based on noisy rewards or human preferences, preserving latent structures and achieving sub-optimality error bounds.


Self-Attention Based Semantic Decomposition in Vector Symbolic Architectures

http://arxiv.org/abs/2403.13218v1

Compressor summary: The paper introduces a new variant of the resonator network that uses self-attention based update rules for faster and better associative memory tasks like pattern recognition, scene decomposition, and object reasoning.


From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards

http://arxiv.org/abs/2403.13213v1

Compressor summary: The paper investigates the effectiveness of safety measures in large language models and shows that safety responses can still encode harmful assumptions, leading to trade-offs between helpfulness and safety.