arxiv compressed, 2024-03-12

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-03-12 generated by the compressor, my personal LLM-based project.


Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling

http://arxiv.org/abs/2403.06978v1

Compressor summary: APT is a computationally efficient method that injects prompts into attention mechanisms for video action recognition tasks, reducing FLOPs and latency while improving performance.


VideoMamba: State Space Model for Efficient Video Understanding

http://arxiv.org/abs/2403.06977v1

Compressor summary: VideoMamba is a novel video understanding model that adapts Mamba to videos, overcoming limitations of existing models and achieving scalability, efficiency, and robustness in various tasks.


BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

http://arxiv.org/abs/2403.06976v1

Compressor summary: BrushNet is a new model that improves image inpainting by dividing the masked image features and noisy latent into separate branches, leading to better results than existing models.


Memory-based Adapters for Online 3D Scene Perception

http://arxiv.org/abs/2403.06974v1

Compressor summary: The paper introduces a framework to enhance offline 3D scene perception models with online capabilities using adapters that leverage temporal information and memory.


A representation-learning game for classes of prediction tasks

http://arxiv.org/abs/2403.06971v1

Compressor summary: The paper proposes a game-based method for learning dimensionality-reducing representations using prior knowledge of future prediction tasks.


MRL Parsing Without Tears: The Case of Hebrew

http://arxiv.org/abs/2403.06970v1

Compressor summary: Key points: - Syntactic parsing is important for information extraction in resource-scarce languages - Existing systems for morphologically rich languages (MRLs) are slow and complex - A new "flipped pipeline" approach improves speed and accuracy for Hebrew NLP tasks Summary: The paper proposes a fast and accurate "flipped pipeline" method for syntactic parsing in MRLs, using Hebrew as a test case.


Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts

http://arxiv.org/abs/2403.06966v1

Compressor summary: Di-SkilL is an RL method for learning diverse skills using Mixture of Experts and maximum entropy optimization, with energy-based models for handling hard discontinuities and multi-modality.


Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena

http://arxiv.org/abs/2403.06965v1

Compressor summary: The text discusses how Construction Grammar can help explain meaning in language constructions, tests Large Language Models on understanding a specific construction (caused-motion), and proposes a novel pipeline for collecting annotated linguistic data using NLP tools.


The pitfalls of next-token prediction

http://arxiv.org/abs/2403.06963v1

Compressor summary: The paper discusses how autoregressive inference and teacher-forced training are different phases of next-token prediction, and argues that teacher-forcing can fail to learn an accurate predictor in certain tasks.


Explainable Transformer Prototypes for Medical Diagnoses

http://arxiv.org/abs/2403.06961v1

Compressor summary: The authors propose a new self-attention mechanism for medical image diagnosis that provides better visual insights and enhances trust in AI decisions.


Optimizing Latent Graph Representations of Surgical Scenes for Zero-Shot Domain Transfer

http://arxiv.org/abs/2403.06953v1

Compressor summary: The paper evaluates object-centric deep learning methods for improving surgical scene understanding across different medical centers and proposes a new approach (LG-DG) that significantly outperforms existing methods.


SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

http://arxiv.org/abs/2403.06952v1

Compressor summary: SELMA improves text-to-image models by using automatically generated data sets to teach different skills and then merging specialized models for faithful image generation.


DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

http://arxiv.org/abs/2403.06951v1

Compressor summary: DEADiff improves text controllability in text-to-image models by decoupling style and semantics using Q-Formers and non-reconstructive learning.


Advancing Generalizable Remote Physiological Measurement through the Integration of Explicit and Implicit Prior Knowledge

http://arxiv.org/abs/2403.06947v1

Compressor summary: The paper presents a new framework for remote photoplethysmography that uses both explicit and implicit prior knowledge to improve performance across different domains and noise sources.


Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

http://arxiv.org/abs/2403.06946v1

Compressor summary: Our UniMoS framework disentangles CLIP's features and trains them separately, improving unsupervised domain adaptation performance for vision-language models.


Counterfactual Reasoning with Knowledge Graph Embeddings

http://arxiv.org/abs/2403.06936v1

Compressor summary: The paper introduces a new task called CFKGR that links knowledge graph completion and counterfactual reasoning, and proposes COULDD, a method for adapting knowledge graph embeddings to detect plausible changes in hypothetical scenarios while retaining original facts.


Naming, Describing, and Quantifying Visual Objects in Humans and LLMs

http://arxiv.org/abs/2403.06935v1

Compressor summary: The text discusses how humans use various expressions for describing objects in images and explores whether large language models can mimic this feature, finding mixed results.


ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis

http://arxiv.org/abs/2403.06932v1

Compressor summary: ERA-CoT is a new method that helps large language models understand context and reason about multiple entities using Chain-of-Thoughts, leading to improved performance on various natural language processing tasks.


Simplicity Bias of Transformers to Learn Low Sensitivity Functions

http://arxiv.org/abs/2403.06925v1

Compressor summary: This paper explores how transformers have a lower sensitivity bias than other neural network architectures, which leads to better robustness and simplicity across different data modalities.


MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning

http://arxiv.org/abs/2403.06914v1

Compressor summary: MEND is a method to improve the efficiency and effectiveness of in-context learning by distilling demonstrations without retraining, achieving better performance than vanilla ICL and other distillation models.


DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

http://arxiv.org/abs/2403.06912v1

Compressor summary: DNGaussian is a framework that uses depth regularization to improve real-time novel view synthesis from sparse input views with low training costs and fast inference speed.


Responsible Artificial Intelligence: A Structured Literature Review

http://arxiv.org/abs/2403.06910v1

Compressor summary: The paper introduces a unified definition of responsible AI, focusing on ethical, explainable, and privacy-preserving AI methods, to help guide future regulation and development.


FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization

http://arxiv.org/abs/2403.06908v1

Compressor summary: FreGS is a technique that improves 3D Gaussian splatting by regulating image frequencies during Gaussian densification, resulting in better real-time novel view synthesis quality.


Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints

http://arxiv.org/abs/2403.06906v1

Compressor summary: DeCCaF is a new L2D framework that learns human error probabilities with less data and minimizes error costs under workload constraints in cost-sensitive scenarios.


Benign overfitting in leaky ReLU networks with moderate input dimension

http://arxiv.org/abs/2403.06903v1

Compressor summary: The paper investigates benign overfitting in two-layer leaky ReLU networks trained with hinge loss on binary classification tasks and shows that high SNR leads to benign overfitting, while low SNR leads to harmful overfitting, both due to approximate margin maximization.


Deep adaptative spectral zoom for improved remote heart rate estimation

http://arxiv.org/abs/2403.06902v1

Compressor summary: The paper proposes a novel data-driven adaptive Chirp-Z Transform estimator for remote heart rate estimation from rPPG signals, achieving outstanding performance across diverse datasets.


GRITv2: Efficient and Light-weight Social Relation Recognition

http://arxiv.org/abs/2403.06895v1

Compressor summary: The research improves GRIT, a relation recognition model, by introducing new features, creating two versions with different sizes, and applying quantization techniques for efficient deployment on mobile devices.


Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

http://arxiv.org/abs/2403.06892v1

Compressor summary: The paper introduces OmDet-Turbo, a fast and accurate transformer-based model for open-vocabulary object detection with an Efficient Fusion Head module.


A Holistic Framework Towards Vision-based Traffic Signal Control with Microscopic Simulation

http://arxiv.org/abs/2403.06884v1

Compressor summary: The study introduces TrafficDojo, a holistic traffic simulation framework for evaluating vision-based traffic signal control methods that reduce congestion and emissions by using end-to-end learning and optimization of traffic signals.


Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning

http://arxiv.org/abs/2403.06880v1

Compressor summary: The text discusses how reward transitions in reinforcement learning tasks, inspired by toddlers' learning from sparse feedback to dense rewards, affect sample efficiency, success rates, and generalization.


COOD: Combined out-of-distribution detection using multiple measures for anomaly & novel class detection in large-scale hierarchical classification

http://arxiv.org/abs/2403.06874v1

Compressor summary: Key points: - Paper proposes a framework for combining OOD measures into one COOD measure using a supervised model - COOD is evaluated on three large-scale biodiversity datasets for anomaly and novel class detection - COOD outperforms individual OOD measures by a large margin and considers ID images incorrectly classified for the original task Summary: The paper presents a supervised framework that combines various OOD measures to improve anomaly and novel class detection in species recognition tasks, achieving better results than existing methods.


Exploring Large Language Models and Hierarchical Frameworks for Classification of Large Unstructured Legal Documents

http://arxiv.org/abs/2403.06872v1

Compressor summary: MESc is a deep-learning framework that uses multi-stage encoder-based supervised clustering to predict legal judgments from large, non-uniform, and unstructured documents, outperforming previous methods by 2 points.


On the Generalization Ability of Unsupervised Pretraining

http://arxiv.org/abs/2403.06871v1

Compressor summary: This paper presents a theoretical framework to understand how unsupervised pre-training affects the generalization of fine-tuned models, and proposes a novel regularization method for better performance.


Semantic Residual Prompts for Continual Learning

http://arxiv.org/abs/2403.06870v1

Compressor summary: Key points: - Continual learning (CL) methods use prompts to adapt a large pre-trained model to new tasks - Prompt selection strategy can suffer from catastrophic forgetting due to changing keys - The method uses CLIP to select prompts and transfer semantics to ViT layers with a residual mechanism - The method outperforms state-of-the-art CL approaches and works well on datasets with domain gap Summary: The paper proposes a novel CL method that uses CLIP to select and transfer prompts, avoiding forgetting and adapting to datasets with domain gap.


Learning with Noisy Foundation Models

http://arxiv.org/abs/2403.06869v1

Compressor summary: This paper analyzes and mitigates label noise in large-scale pre-training datasets to improve generalization and reduce risks in foundation models.


QUASAR: QUality and Aesthetics Scoring with Advanced Representations

http://arxiv.org/abs/2403.06866v1

Compressor summary: The paper presents a new non-parametric method for evaluating image quality and aesthetics that outperforms existing approaches, requires no additional engineering, and agrees well with human judgments.


Real-Time Simulated Avatar from Head-Mounted Sensors

http://arxiv.org/abs/2403.06862v1

Compressor summary: SimXR is a method that uses information from VR/AR headsets to control a humanoid avatar's movement in simulation, combining headset poses and image analysis.


A Geospatial Approach to Predicting Desert Locust Breeding Grounds in Africa

http://arxiv.org/abs/2403.06860v1

Compressor summary: Key points: - The study develops a model for predicting locust breeding grounds using spatio-teminal input features and deep learning - The model outperforms existing baselines and uses multi-spectral earth observation images as the main input - Multi-spectral images alone are sufficient for prediction without other environmental or climatic data Summary: The study creates a better model for predicting locust breeding grounds using deep learning and multi-spectral images, which can enhance early warning systems and control measures.


Development of a Reliable and Accessible Caregiving Language Model (CaLM)

http://arxiv.org/abs/2403.06857v1

Compressor summary: The study develops a Caregiving Language Model (CaLM) using small language models and a caregiving knowledge base, finding it performs better than GPT-3.5 in supporting family caregivers of individuals with Alzheimer's Disease Related Dementias.


Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

http://arxiv.org/abs/2403.06854v1

Compressor summary: The paper analyses how sensitive inverse reinforcement learning (IRL) is to misspecification of behavioural models and provides conditions for when IRL is robust or not.


DiaLoc: An Iterative Approach to Embodied Dialog Localization

http://arxiv.org/abs/2403.06846v1

Compressor summary: DiaLoc is a dialog-based localization framework that uses multimodal data and iterative refinement to achieve state-of-the-art results in embodied dialog-based localization tasks, both in single-shot and multi-shot settings.


DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

http://arxiv.org/abs/2403.06845v1

Compressor summary: DriveDreamer-2 is a system that uses a large language model to generate customized driving videos with high quality and coherence for enhancing driving perception methods training.


Towards an educational tool for supporting neonatologists in the delivery room

http://arxiv.org/abs/2403.06843v1

Compressor summary: The paper proposes a machine learning approach to identify risk factors for infant resuscitation at birth and aims to develop a mobile app for healthcare personnel to use in the delivery room.


RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

http://arxiv.org/abs/2403.06840v1

Compressor summary: RA-ISF is a framework that improves large language models' problem-solving by iteratively decomposing tasks and integrating external knowledge, outperforming existing methods on GPT3.5 and Llama2.


Stochastic Cortical Self-Reconstruction

http://arxiv.org/abs/2403.06837v1

Compressor summary: The authors propose a new method called stochastic cortical self-reconstruction (SCSR) that creates subject-specific healthy reference ranges for assessing cortical atrophy in neurodegenerative diseases using MRI data and various machine learning models.


Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting

http://arxiv.org/abs/2403.06835v1

Compressor summary: The proposed model generates detailed and accurate synthetic medical images by aligning descriptive text prompts with image features using fine-grained alignment techniques.


Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

http://arxiv.org/abs/2403.06833v1

Compressor summary: The text introduces a formal measure to evaluate the instruction-data separation in LLMs, a new dataset (SEP) to estimate it, and shows that current LLMs lack this separation.


The Power of Noise: Toward a Unified Multi-modal Knowledge Graph Representation Framework

http://arxiv.org/abs/2403.06832v1

Compressor summary: The paper proposes SNAG, a Transformer-based method that learns robust multi-modal entity features in knowledge graphs using modality-level noise masking and specific training objectives for two tasks: MKGC and MMEA.


HDRTransDC: High Dynamic Range Image Reconstruction with Transformer Deformation Convolution

http://arxiv.org/abs/2403.06831v1

Compressor summary: The HDRTransDC network combines TDCAM and DWFB to generate high-quality HDR images by eliminating ghosting artifacts and fusion distortions in multi-exposure LDR images.


Constructing Variables Using Classifiers as an Aid to Regression: An Empirical Assessment

http://arxiv.org/abs/2403.06829v1

Compressor summary: The paper introduces a method to improve regression by discretizing continuous variables, creating value thresholds, training classifiers, and concatenating outputs into an enriched vector.


In-context Exploration-Exploitation for Reinforcement Learning

http://arxiv.org/abs/2403.06826v1

Compressor summary: The paper introduces ICEE, an efficient algorithm for online policy learning in offline RL that balances exploration and exploitation within a Transformer model without Bayesian inference.


ε-Neural Thompson Sampling of Deep Brain Stimulation for Parkinson Disease Treatment

http://arxiv.org/abs/2403.06814v1

Compressor summary: The text describes a novel contextual multi-armed bandit approach for adaptive deep brain stimulation to treat Parkinson's disease, which improves efficiency and reduces side effects compared to traditional methods.


LeOCLR: Leveraging Original Images for Contrastive Learning of Visual Representations

http://arxiv.org/abs/2403.06813v1

Compressor summary: LeOCLR is a new framework for contrastive learning of visual representations that improves representation learning by ensuring shared regions between positive pairs are semantically correct, outperforming baseline models on different datasets.


Monotone Individual Fairness

http://arxiv.org/abs/2403.06812v1

Compressor summary: The paper proposes algorithms for online learning with individual fairness that use monotone aggregation functions to collect feedback from multiple auditors, achieving better regret and fairness violations bounds than previous methods.


Multistep Consistency Models

http://arxiv.org/abs/2403.06807v1

Compressor summary: Multistep Consistency Models combine consistency and diffusion models to balance sampling speed and quality, achieving impressive results on image generation tasks.


On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

http://arxiv.org/abs/2403.06806v1

Compressor summary: The paper analyzes the convergence rate of policy gradient for infinite horizon average reward Markov decision processes and shows that it converges at a sublinear rate with finite-time performance guarantees.


Shape Non-rigid Kinematics (SNK): A Zero-Shot Method for Non-Rigid Shape Matching via Unsupervised Functional Map Regularized Reconstruction

http://arxiv.org/abs/2403.06804v1

Compressor summary: Shape Non-rigid Kinematics (SNK) is a novel method for matching non-rigid shapes that doesn't require training or ground truth data, using an encoder-decoder architecture and an unsupervised functional map.


Data-Independent Operator: A Training-Free Artifact Representation Extractor for Generalizable Deepfake Detection

http://arxiv.org/abs/2403.06803v1

Compressor summary: The authors propose a data-independent operator (DIO) to detect fake images generated by various models, achieving state-of-the-art performance without requiring training or large models.


MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology

http://arxiv.org/abs/2403.06800v1

Compressor summary: MambaMIL uses a sequence model to improve feature extraction and overfitting in computational pathology, outperforming existing Multiple Instance Learning approaches.


Leveraging Internal Representations of Model for Magnetic Image Classification

http://arxiv.org/abs/2403.06797v1

Compressor summary: The paper proposes using deep learning to generate informative samples from sparse data for training autonomous systems securely.


Boosting Image Restoration via Priors from Pre-trained Models

http://arxiv.org/abs/2403.06793v1

Compressor summary: The paper proposes a lightweight module called PTG-RM that uses pre-trained models to improve image restoration tasks such as enhancing low-light images, removing rain, blur, and noise from images.


Genetic Learning for Designing Sim-to-Real Data Augmentations

http://arxiv.org/abs/2403.06786v1

Compressor summary: The paper proposes interpretable metrics to predict how well augmentation policies work for sim-to-real object detection tasks and introduces GeneticAugment, a method that uses these metrics to automatically design augmentation policies.


FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

http://arxiv.org/abs/2403.06775v1

Compressor summary: The paper proposes a method to improve text-to-image generation by modeling subjects as derived classes that inherit both public and private attributes from their categories, leading to more realistic and imaginative attribute-related generations.


Redefining Event Types and Group Evolution in Temporal Data

http://arxiv.org/abs/2403.06771v1

Compressor summary: The paper proposes a novel framework for characterizing group dynamics in temporal data using archetypal events defined by facets, enabling richer and more reliable analyses of complex group relationships.


Strength Lies in Differences! Towards Effective Non-collaborative Dialogues via Tailored Strategy Planning

http://arxiv.org/abs/2403.06769v1

Compressor summary: The paper proposes TRIP, a dialogue agent that can tailor its strategic planning for diverse users and perform well on non-collaborative dialogue tasks.


XB-MAML: Learning Expandable Basis Parameters for Effective Meta-Learning with Wide Task Coverage

http://arxiv.org/abs/2403.06768v1

Compressor summary: XB-MAML is a meta-learning method that learns expandable basis parameters to form an effective initialization for diverse unseen tasks by adaptively expanding them based on discrepancy with fine-tuned parameters.


ConspEmoLLM: Conspiracy Theory Detection Using an Emotion-Based Large Language Model

http://arxiv.org/abs/2403.06765v1

Compressor summary: The paper introduces ConspEmoLLM, an open-source natural language processing model that detects and analyzes conspiracy theories by incorporating affective features such as sentiment and emotions.


An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

http://arxiv.org/abs/2403.06764v1

Compressor summary: The study identifies attention inefficiency in large vision-language models and introduces FastV, a method to optimize efficiency and performance in image and video understanding tasks.


Average Calibration Error: A Differentiable Loss for Improved Reliability in Image Segmentation

http://arxiv.org/abs/2403.06759v1

Compressor summary: The paper proposes mL1-ACE, a novel loss function to improve medical image segmentation by reducing overconfidence and calibration errors, while maintaining high Dice scores.


EarthLoc: Astronaut Photography Localization by Indexing Earth from Space

http://arxiv.org/abs/2403.06758v1

Compressor summary: The paper proposes EarthLoc, a novel model that uses image retrieval to efficiently and accurately localize astronaut photography for scientific research and disaster response.


Koopman Ensembles for Probabilistic Time Series Forecasting

http://arxiv.org/abs/2403.06757v1

Compressor summary: The paper proposes a new method for training multiple models to predict uncertain outcomes in dynamical systems, such as weather, by encouraging them to disagree with each other.


ALaRM: Align Language Models via Hierarchical Rewards Modeling

http://arxiv.org/abs/2403.06754v1

Compressor summary: ALaRM is a framework that models hierarchical rewards to improve the alignment of large language models with human preferences in complex text generation tasks.


ACT-MNMT Auto-Constriction Turning for Multilingual Neural Machine Translation

http://arxiv.org/abs/2403.06745v1

Compressor summary: The paper introduces a novel supervised fine-tuning mechanism for multilingual neural machine translation that improves performance and reduces off-target issues by automatically constructing constrained templates with trigger tokens.


Distribution-Aware Data Expansion with Diffusion Models

http://arxiv.org/abs/2403.06741v1

Compressor summary: DistDiff is a data expansion framework that uses hierarchical prototypes to generate distribution-consistent samples, improving performance of deep models without additional training.


V3D: Video Diffusion Models are Effective 3D Generators

http://arxiv.org/abs/2403.06738v1

Compressor summary: V3D uses pre-trained video diffusion models to create high-quality 3D objects from single images with geometrical consistency and fast generation speed.


Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback

http://arxiv.org/abs/2403.06735v1

Compressor summary: The paper presents a method to improve text-generative models for image captions using Supervised Learning, Reinforcement Learning with Human Feedback, and a novel loss function based on the Flickr8k dataset.


Real-Time Multimodal Cognitive Assistant for Emergency Medical Services

http://arxiv.org/abs/2403.06734v1

Compressor summary: CognitiveEMS is a wearable system that uses speech recognition, graph-based attention, and action recognition to assist EMS responders in real-time during emergencies.


Large Model driven Radiology Report Generation with Clinical Quality Reinforcement Learning

http://arxiv.org/abs/2403.06728v1

Compressor summary: The paper proposes LM-RRG, a novel radiology report generation method that combines large models with clinical quality reinforcement learning to produce accurate and comprehensive chest X-ray reports.


Probabilistic Contrastive Learning for Long-Tailed Visual Recognition

http://arxiv.org/abs/2403.06726v1

Compressor summary: The paper proposes ProCo, a probabilistic contrastive learning algorithm that estimates class distributions using mixture of von Mises-Fisher distributions and samples contrastive pairs accordingly to handle imbalanced data.


Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization

http://arxiv.org/abs/2403.06702v1

Compressor summary: E3-FaceNet is a network that generates and manipulates 3D face models from text instructions with high efficiency and quality, using a direct mapping and novel enhancements.


Advancing Graph Neural Networks with HL-HGAT: A Hodge-Laplacian and Attention Mechanism Approach for Heterogeneous Graph-Structured Data

http://arxiv.org/abs/2403.06687v1

Compressor summary: The HL-HGAT is a novel graph neural network that learns from $k$-simplices using Hodge-Laplacian convolutional filters, simplicial projection, and simplicial attention pooling for various applications.


Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency

http://arxiv.org/abs/2403.06683v1

Compressor summary: The paper explores using deep learning models trained on natural images to infer depth in endoscopic videos, and improves their performance by adding temporal consistency self-supervision.


Restoring Ancient Ideograph: A Multimodal Multitask Neural Network Approach

http://arxiv.org/abs/2403.06682v1

Compressor summary: The paper presents a novel model that uses multimodal deep learning to restore ancient texts, particularly ideographs, by combining context understanding with visual information from damaged artefacts.


Trustworthy Partial Label Learning with Out-of-distribution Detection

http://arxiv.org/abs/2403.06681v1

Compressor summary: PLL-OOD is a novel method for learning from ambiguously labelled data that incorporates Out-of-Distribution detection to enhance model adaptability and accuracy in open-world settings.


Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

http://arxiv.org/abs/2403.06679v1

Compressor summary: The paper proposes a method called mutual correlation distillation (MCD) to improve audio-visual question answering by enhancing soft associations, aligning cross-modal features, and decoupling audio-visual dependencies, leading to better performance on two datasets.


Streamlining in the Riemannian Realm: Efficient Riemannian Optimization with Loopless Variance Reduction

http://arxiv.org/abs/2403.06677v1

Compressor summary: The study proposes R-LSVRG and R-PAGE methods for stochastic optimization on Riemannian manifolds, which simplify proofs, hyperparameter selection, and have sharp convergence guarantees, and applies them to non-convex distributed settings with communication compression.


CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective

http://arxiv.org/abs/2403.06676v1

Compressor summary: Large kernel CNNs perform well in downstream tasks like weakly supervised object localization, with feature map improvement being the main factor for their success, and they are robust to CAM problems.


Car Damage Detection and Patch-to-Patch Self-supervised Image Alignment

http://arxiv.org/abs/2403.06674v1

Compressor summary: The text describes an application that uses computer vision to detect car damages and align pre-trip and post-trip images for insurance purposes, using a Mask R-CNN model and a self-supervised SimCLR alignment approach.


CEAT: Continual Expansion and Absorption Transformer for Non-Exemplar Class-Incremental Learnin

http://arxiv.org/abs/2403.06670v1

Compressor summary: CEAT is a new architecture that enables models to learn new tasks without forgetting old ones while protecting privacy by extending and absorbing layers and using prototype contrastive loss and pseudo-features.


PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor

http://arxiv.org/abs/2403.06668v1

Compressor summary: PeerAiD uses a peer network to defend against adversarial examples targeting a student network, improving its robustness in security-critical domains.


Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework

http://arxiv.org/abs/2403.06658v1

Compressor summary: The paper proposes a biometric recognition framework that uses synthetic data and local registration to address data demands, domain generalization, and interpretability issues.


Elephants Never Forget: Testing Language Models for Memorization of Tabular Data

http://arxiv.org/abs/2403.06644v1

Compressor summary: The authors investigate how Large Language Models (LLMs) handle tabular data, finding that they may be contaminated or memorize the data, which can affect their performance on downstream tasks.


Spatial features of CO2 for occupancy detection in a naturally ventilated school building

http://arxiv.org/abs/2403.06643v1

Compressor summary: The study proposes two new features for occupancy detection based on CO2 concentration spatial distribution, improving accuracy and quantity estimation in naturally ventilated buildings without or with ventilation information.


Evaluating the Energy Efficiency of Few-Shot Learning for Object Detection in Industrial Settings

http://arxiv.org/abs/2403.06631v1

Compressor summary: The paper explores finetuning object detection models for few-shot learning, evaluates energy demands in industrial settings, and proposes an Efficiency Factor metric to measure the trade-off between performance and efficiency.


Density-Guided Label Smoothing for Temporal Localization of Driving Actions

http://arxiv.org/abs/2403.06616v1

Compressor summary: The text describes a method for improving the accuracy and efficiency of localizing driving actions in videos using density-guided label smoothing and post-processing techniques, achieving competitive results in a naturalistic driving action recognition challenge.


MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway Encoding

http://arxiv.org/abs/2403.06611v1

Compressor summary: The MedKP framework improves large language models' performance in medical dialogue generation by incorporating external knowledge from a medical knowledge graph and internal clinical pathway encoding via entities and actions.


Guiding Clinical Reasoning with Large Language Models via Knowledge Seeds

http://arxiv.org/abs/2403.06609v1

Compressor summary: In-Context Padding (ICP) is a novel framework that enhances large language models' clinical reasoning by guiding them with critical medical knowledge elements called knowledge seeds.


Distributionally Generative Augmentation for Fair Facial Attribute Classification

http://arxiv.org/abs/2403.06606v1

Compressor summary: Our method edits and augments images to remove bias in facial attribute classification without needing extra labels.


Cross-domain and Cross-dimension Learning for Image-to-Graph Transformers

http://arxiv.org/abs/2403.06601v1

Compressor summary: Key points: - Direct image-to-graph transformation is a complex task that requires large training datasets - The paper proposes methods for cross-domain and cross-dimension transfer learning for image-to-graph transformers - The methods include edge sampling loss, domain adaptation framework, and projection function - The method shows better performance on various benchmarks, such as retinal or whole-brain vessel graph extraction Summary: The paper presents novel methods for improving cross-domain and cross-dimension transfer learning for image-to-graph transformers, which enhance object detection and relationship prediction.


BEV2PR: BEV-Enhanced Visual Place Recognition with Structural Cues

http://arxiv.org/abs/2403.06600v1

Compressor summary: The paper proposes BEV2PR, a VPR framework that uses bird's-eye view segmentation features and a single monocular camera to generate composite descriptors with visual cues and spatial awareness, improving performance over existing camera-based methods.


Exploiting Style Latent Flows for Generalizing Deepfake Detection Video Detection

http://arxiv.org/abs/2403.06592v1

Compressor summary: The paper proposes a new method for detecting fake videos by analyzing how the style latent vectors change over time in generated videos, which reveals abnormal patterns that indicate manipulation.


Academically intelligent LLMs are not necessarily socially intelligent

http://arxiv.org/abs/2403.06591v1

Compressor summary: The text introduces a new test (SESI) to evaluate the social intelligence of large language models (LLMs), which shows they have room for improvement and social intelligence is distinct from academic intelligence.


ContextGPT: Infusing LLMs Knowledge into Neuro-Symbolic Activity Recognition Models

http://arxiv.org/abs/2403.06586v1

Compressor summary: ContextGPT uses prompt engineering to retrieve common-sense knowledge from Large Language Models for context-aware Human Activity Recognition, requiring less human effort and expertise than ontologies.


Transformer-based Fusion of 2D-pose and Spatio-temporal Embeddings for Distracted Driver Action Recognition

http://arxiv.org/abs/2403.06577v1

Compressor summary: The study adapts a transformer-based fusion architecture to improve temporal localization and classification accuracy in driver-assistance systems using video action recognition and 2D human-pose estimation.


FFAD: A Novel Metric for Assessing Generated Time Series Data Utilizing Fourier Transform and Auto-encoder

http://arxiv.org/abs/2403.06576v1

Compressor summary: The paper introduces FFAD, a new metric for evaluating quality of synthetic time series data using the Fourier transform and an auto-encoder.


AC-EVAL: Evaluating Ancient Chinese Language Understanding in Large Language Models

http://arxiv.org/abs/2403.06574v1

Compressor summary: AC-EVAL is an innovative benchmark to evaluate Large Language Models' understanding of ancient Chinese across three levels of difficulty and 13 tasks, aiming to improve their performance in education and research.


Scalable Online Exploration via Coverability

http://arxiv.org/abs/2403.06571v1

Compressor summary: The text introduces $L_1$-Coverage, a new exploration objective in reinforcement learning that balances intrinsic complexity control, efficient planning, and efficient exploration for high-dimensional domains.


Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

http://arxiv.org/abs/2403.06570v1

Compressor summary: The paper presents a new pipeline for optimizing speaker assignment in real-life meeting transcription using VAD, SD, and SA-ASR, and demonstrates improvements by fine-tuning the SA-ASR model and extracting speaker embedding templates from SD output.


Enhancing Joint Motion Prediction for Individuals with Limb Loss Through Model Reprogramming

http://arxiv.org/abs/2403.06569v1

Compressor summary: The authors use deep learning to adapt models trained on able-bodied data to predict joint motion for amputee patients, potentially improving assistive technologies.


Better Understandings and Configurations in MaxSAT Local Search Solvers via Anytime Performance Analysis

http://arxiv.org/abs/2403.06568v1

Compressor summary: This paper proposes a new way to compare MaxSAT solvers using Empirical Cumulative Distribution Functions, which can help optimize their parameters and show differences in their performance across different time budgets.


Leveraging Foundation Models for Content-Based Medical Image Retrieval in Radiology

http://arxiv.org/abs/2403.06567v1

Compressor summary: The authors propose using vision foundation models as feature extractors for content-based medical image retrieval and show that weakly-supervised models achieve competitive performance without fine-tuning.


Unraveling the Mystery of Scaling Laws: Part I

http://arxiv.org/abs/2403.06563v1

Compressor summary: This report provides a detailed analysis of scaling law principles for large language models, deriving precise formulas to predict various attributes such as test loss, training steps, and batch size for models up to 33 billion parameters.


Sliced-Wasserstein Distances and Flows on Cartan-Hadamard Manifolds

http://arxiv.org/abs/2403.06560v1

Compressor summary: This paper presents a method to compute Sliced-Wasserstein distance on certain manifolds and applies it to various problems with non-Euclidean data.


OMH: Structured Sparsity via Optimally Matched Hierarchy for Unsupervised Semantic Segmentation

http://arxiv.org/abs/2403.06546v1

Compressor summary: OMH is a novel approach for unsupervised semantic segmentation that uses structured sparsity and optimal transport to learn a hierarchy among parallel clusters, improving performance over existing methods.


On the Consideration of AI Openness: Can Good Intent Be Abused?

http://arxiv.org/abs/2403.06537v1

Compressor summary: The authors demonstrate how an open-source language model can be adapted to provide unethical answers about criminal activities using a new dataset, EVE, highlighting the need for caution with open technologies.


Multi-Scale Implicit Transformer with Re-parameterize for Arbitrary-Scale Super-Resolution

http://arxiv.org/abs/2403.06536v1

Compressor summary: The Multi-Scale Implicit Transformer (MSIT) uses a novel approach to improve the performance of arbitrary-scale super-resolution by exploiting multi-scale characteristics and enhancing latent codes with self-attention and re-interaction modules.


Decentralized and Lifelong-Adaptive Multi-Agent Collaborative Learning

http://arxiv.org/abs/2403.06535v1

Compressor summary: DeLAMA is a new algorithm that helps multiple agents collaborate efficiently without a central server by learning graph structures, using memory to store knowledge, and applying optimization and neural networks.


SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

http://arxiv.org/abs/2403.06534v1

Compressor summary: The authors create a large-scale, diverse SAR object detection dataset (SARDet-100K) and propose a novel pretraining framework (MSFA) to improve the performance of SAR object detection models.


Confidence-Aware RGB-D Face Recognition via Virtual Depth Synthesis

http://arxiv.org/abs/2403.06529v1

Compressor summary: The authors propose a domain-independent pre-training framework for RGB-D face recognition that uses depth models from 3D Morphable Models and an Adaptive Confidence Weighting mechanism to fuse RGB and depth information, achieving state-of-the-art performance.


Tactical Decision Making for Autonomous Trucks by Deep Reinforcement Learning with Total Cost of Operation Based Reward

http://arxiv.org/abs/2403.06524v1

Compressor summary: The paper presents a deep reinforcement learning framework for autonomous trucks to make tactical decisions in ACC and lane change maneuvers, and explores various methods to optimize performance with a multi-objective reward function based on TCOP.


How to Understand Named Entities: Using Common Sense for News Captioning

http://arxiv.org/abs/2403.06520v1

Compressor summary: The paper proposes a commonsense-based approach for news captioning that distinguishes similar entities and enriches their descriptions with relevant information.


Active Generation for Image Classification

http://arxiv.org/abs/2403.06517v1

Compressor summary: The paper proposes ActGen, an active learning approach for image generation, which uses real images as guides and generates challenging samples to improve image classification accuracy efficiently.


Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning

http://arxiv.org/abs/2403.06516v1

Compressor summary: The paper proposes a reinforcement learning framework that uses comparative feedback to generate realistic chest X-rays from diagnostic reports.


Structure Your Data: Towards Semantic Graph Counterfactuals

http://arxiv.org/abs/2403.06514v1

Compressor summary: The paper proposes a method to generate more descriptive and accurate explanations for model predictions using semantic graphs, outperforming previous models in both quantitative and qualitative evaluations.


Skeleton Supervised Airway Segmentation

http://arxiv.org/abs/2403.06510v1

Compressor summary: The text introduces a new annotation method (SkA) and a learning framework for accurate airway segmentation using less labeling and improving consistency and accuracy.


Vosh: Voxel-Mesh Hybrid Representation for Real-Time View Synthesis

http://arxiv.org/abs/2403.06505v1

Compressor summary: Vosh is a hybrid representation of NeRF that combines voxels and mesh for fast and high-quality image synthesis with adjustable balance.


3D Semantic Segmentation-Driven Representations for 3D Object Detection

http://arxiv.org/abs/2403.06501v1

Compressor summary: The paper introduces SeSame, a new way to represent 3D object detection data that combines semantic and geometric features, improving accuracy in autonomous driving.


QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning

http://arxiv.org/abs/2403.06497v1

Compressor summary: QuantTune is a method that fine-tunes transformer-based models for better post-training linear quantization and reduces accuracy drops caused by precision loss due to outliers.


Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts

http://arxiv.org/abs/2403.06495v1

Compressor summary: The paper proposes a novel method, InCTRL, that uses few-shot normal images as prompts to train a generalist anomaly detection model on diverse datasets without extra training data.


Graph Neural Network with Two Uplift Estimators for Label-Scarcity Individual Uplift Modeling

http://arxiv.org/abs/2403.06489v1

Compressor summary: The paper proposes GNUM, a graph neural network-based framework with two uplift estimators to learn from social graphs for uplift estimation in randomized experiments or observational data.


Query-guided Prototype Evolution Network for Few-Shot Segmentation

http://arxiv.org/abs/2403.06488v1

Compressor summary: QPENet uses query features to create custom foreground and background prototypes for few-shot segmentation, improving performance on PASCAL-$5^i$ and COCO-$20^i$ datasets.


Multilingual Turn-taking Prediction Using Voice Activity Projection

http://arxiv.org/abs/2403.06487v1

Compressor summary: The paper explores using a voice activity projection model to predict turn-taking in multilingual dialogue and shows that a multilingual model outperforms monolingual ones, while also analyzing the role of pitch and audio encoders.


The negation of permutation mass function

http://arxiv.org/abs/2403.06483v1

Compressor summary: This paper proposes a negation method for random permutation sets theory, studies its convergence, and shows its effects on uncertainty and dissimilarity using numerical examples.


Ada-Tracker: Soft Tissue Tracking via Inter-Frame and Adaptive-Template Matching

http://arxiv.org/abs/2403.06479v1

Compressor summary: The paper introduces Ada-Tracker, a method for tracking soft tissues in computer-assisted surgeries using optical flow to capture deformations and adaptively correct the template.


Toward Robust Canine Cardiac Diagnosis: Deep Prototype Alignment Network-Based Few-Shot Segmentation in Veterinary Medicine

http://arxiv.org/abs/2403.06471v1

Compressor summary: This study presents DPANet, a few-shot segmentation method for accurately segmenting the heart and left atrial enlargement on canine chest radiographs, setting a new benchmark in veterinary AI research.


3D-aware Image Generation and Editing with Multi-modal Conditions

http://arxiv.org/abs/2403.06470v1

Compressor summary: Key points: - The paper proposes a 3D-aware image generation and editing model with multiple conditional inputs. - The model uses a novel disentanglement strategy to separate shape and appearance features in the latent space of GANs. - The model can generate diverse images, edit attributes via text, and conduct style transfer with a reference image. Summary: The paper presents a new 3D-aware image generation and editing model that leverages multiple inputs and disentangles shape and appearance in GANs' latent space, enabling diverse images and flexible attribute editing.


Point Mamba: A Novel Point Cloud Backbone Based on State Space Model with Octree-Based Ordering Strategy

http://arxiv.org/abs/2403.06467v1

Compressor summary: Key points: - State space model (SSM) is a powerful tool for language and image processing, but hard to apply to point clouds due to their disorder and causality requirements - The paper proposes Point Mamba, a SSM-based point cloud processing backbone with an octree-based ordering strategy that preserves causal dependency and spatial proximity of points - Point Mamba achieves state-of-the-art performance and linear complexity on two benchmark datasets, outperforming transformer-based methods Summary: Point Mamba is a novel SSM-based point cloud processing method that uses an octree ordering strategy to capture causal dependency and spatial proximity of points, achieving superior accuracy and efficiency on language and image tasks.


RL-MSA: a Reinforcement Learning-based Multi-line bus Scheduling Approach

http://arxiv.org/abs/2403.06466v1

Compressor summary: The paper proposes a Reinforcement Learning-based Multi-line bus Scheduling Approach (RL-MSA) that handles uncertain events like traffic congestion and considers the interests of both the bus company and passengers.


Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation

http://arxiv.org/abs/2403.06462v1

Compressor summary: The paper proposes a new semi-supervised semantic segmentation method called Density-Descending Feature Perturbation (DDFP) that improves feature density estimation and exploration for better segmentation performance.


Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation

http://arxiv.org/abs/2403.06461v1

Compressor summary: Latte is a multi-modal test-time adaptation method for 3D segmentation that uses reliable cross-modal spatial-temporal correspondences and temporal local prediction consistency to adapt models to unlabeled target domains.


Prediction of Wort Density with LSTM Network

http://arxiv.org/abs/2403.06458v1

Compressor summary: The article presents a system using a neural network (LSTM) to estimate wort density for beer production from cheaper sensor data like pressure or temperature.


Ensemble Quadratic Assignment Network for Graph Matching

http://arxiv.org/abs/2403.06457v1

Compressor summary: The paper proposes a graph neural network approach that combines data-driven and traditional graph-matching methods, improving performance and reducing computational complexity.


FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications

http://arxiv.org/abs/2403.06453v1

Compressor summary: FontCLIP is a model that combines vision-language understanding with typography knowledge to find fonts across languages and attributes, even for unseen data.


Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation

http://arxiv.org/abs/2403.06452v1

Compressor summary: Text2QR is a method that uses stable-diffusion models to create visually appealing QR codes while maintaining scanning robustness.


Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

http://arxiv.org/abs/2403.06448v1

Compressor summary: The paper introduces MIND, an unsupervised framework for real-time hallucination detection in large language models, and HELM, a new benchmark for evaluating such detection.


Latent Semantic Consensus For Deterministic Geometric Model Fitting

http://arxiv.org/abs/2403.06444v1

Compressor summary: The paper introduces Latent Semantic Consensus (LSC), a method to fit geometric models to noisy data by preserving latent semantic spaces, and shows its effectiveness and efficiency in computer vision tasks.


Temporal-Mapping Photography for Event Cameras

http://arxiv.org/abs/2403.06443v1

Compressor summary: Event-Based Temporal Mapping Photography (EvTemMap) converts events from a stationary event camera into dense intensity images using temporal mapping neural networks, achieving high dynamic range and fine-grained details in static scenes.


Fine-Grained Pillar Feature Encoding Via Spatio-Temporal Virtual Grid for 3D Object Detection

http://arxiv.org/abs/2403.06433v1

Compressor summary: The paper proposes Fine-Grained Pillar Feature Encoding (FG-PFE), which uses Spatio-Temporal Virtual grids to capture LiDAR point distributions within pillar structures and improve 3D object detection for autonomous vehicles.


Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain

http://arxiv.org/abs/2403.06432v1

Compressor summary: Key points: - GNNs can learn dynamic functional connectivity from human brain networks, but labeled data is scarce and costly - ST-JEMA is a self-supervised generative model that reconstructs dynamic graphs and captures high-level semantic representations - ST-JEMA outperforms previous methods in predicting phenotypes and diagnoses, and works well on missing data scenarios Summary: ST-JEMA is a novel self-supervised method that learns semantic representations of dynamic functional connectivity from fMRI data using generative reconstruction, improving phenotype and diagnosis prediction and handling missing data.


AS-FIBA: Adaptive Selective Frequency-Injection for Backdoor Attack on Deep Face Restoration

http://arxiv.org/abs/2403.06430v1

Compressor summary: The paper proposes a new backdoor attack on face restoration models using subtle frequency domain triggers, making the attacks imperceptible but still effective.


A Differential Geometric View and Explainability of GNN on Evolving Graphs

http://arxiv.org/abs/2403.06425v1

Compressor summary: The paper proposes a smooth and interpretable way to model how Graph Neural Networks (GNN) predict distributions on evolving graphs using differential geometry.


A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos

http://arxiv.org/abs/2403.06421v1

Compressor summary: The text discusses the need for better performance evaluation of talking head generation techniques using psychophysical experiments and human validation.


Enhanced Sparsification via Stimulative Training

http://arxiv.org/abs/2403.06417v1

Compressor summary: Sparsification-based pruning improves model compression by enhancing the expressivity of kept weights and maintaining the magnitude of dropped weights, achieving superior performance without fine-tuning under aggressive pruning scenarios.


Evolving Knowledge Distillation with Large Language Models and Active Learning

http://arxiv.org/abs/2403.06414v1

Compressor summary: EvoKD is a method that uses active learning and feedback to generate diverse and challenging data for distilling knowledge from large language models to small ones, improving their performance on various NLP tasks.


CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean

http://arxiv.org/abs/2403.06412v1

Compressor summary: CLIcK is a new benchmark dataset for testing Korean language models' cultural and linguistic knowledge using questions from official exams and textbooks.


A Logical Pattern Memory Pre-trained Model for Entailment Tree Generation

http://arxiv.org/abs/2403.06410v1

Compressor summary: The paper proposes LMPM, a pre-trained AI model that uses external memory and entity abstraction to generate entailment trees with logical consistency and improved credibility.


What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation

http://arxiv.org/abs/2403.06408v1

Compressor summary: The lens of perturbation helps understand how artificial perturbations affect large language models' performance, suggesting non-uniform quantization can improve efficiency without sacrificing performance.


Can LLMs' Tuning Methods Work in Medical Multimodal Domain?

http://arxiv.org/abs/2403.06407v1

Compressor summary: This paper explores how to efficiently fine-tune large language models for specific tasks in the medical domain by comparing different methods and optimizing training costs.


Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents

http://arxiv.org/abs/2403.06406v1

Compressor summary: The paper proposes a new method to compare no-reference image quality assessment (NR-IQA) models using analysis-by-synthesis framework and psychophysical testing, which provides better insights than conventional correlation-based metrics.


PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models

http://arxiv.org/abs/2403.06403v1

Compressor summary: PointSeg is a training-free method that uses off-the-shelf vision foundation models to segment 3D scenes accurately by constructing 3D point-box prompts pairs and applying iterative post-refinement and merging algorithms.


'One size doesn't fit all': Learning how many Examples to use for In-Context Learning for Improved Text Classification

http://arxiv.org/abs/2403.06402v1

Compressor summary: The paper proposes adaptive in-context learning, where the number of examples used to control a generative model varies based on the similarity between the input and training data instances, improving text classification performance.


Refining Segmentation On-the-Fly: An Interactive Framework for Point Cloud Semantic Segmentation

http://arxiv.org/abs/2403.06401v1

Compressor summary: This paper introduces InterPCSeg, a framework that allows users to improve point cloud semantic segmentation by providing corrective clicks, without needing offline re-training or specialized networks.


DivCon: Divide and Conquer for Progressive Text-to-Image Generation

http://arxiv.org/abs/2403.06400v1

Compressor summary: The paper introduces a method to improve image generation from text by dividing the task into simpler subtasks and using layout information to guide the process.


GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing

http://arxiv.org/abs/2403.06399v1

Compressor summary: The authors create a large IGT corpus and use a pretrained multilingual model to generate IGT for various languages, improving performance on unsegmented text and small corpora.


On the Diminishing Returns of Width for Continual Learning

http://arxiv.org/abs/2403.06398v1

Compressor summary: The text discusses how the width of neural networks affects their ability to avoid catastrophic forgetting when learning new tasks sequentially, and presents a theoretical framework to analyze this relationship.


DeepSafeMPC: Deep Learning-Based Model Predictive Control for Safe Multi-Agent Reinforcement Learning

http://arxiv.org/abs/2403.06397v1

Compressor summary: DeepSafeMPC is a novel method that uses centralized deep learning to predict environmental dynamics and apply Model Predictive Control to ensure safety in multi-agent reinforcement learning.


FSViewFusion: Few-Shots View Generation of Novel Objects

http://arxiv.org/abs/2403.06394v1

Compressor summary: The authors explore how a diffusion model called Dreambooth can synthesize views of novel objects without 3D priors, and introduce a method to transfer view knowledge from one object to another using low rank adapters.


Towards Robust Out-of-Distribution Generalization Bounds via Sharpness

http://arxiv.org/abs/2403.06392v1

Compressor summary: The paper explores how sharpness of learned minima affects out-of-distribution generalization and provides a tighter bound by considering robustness.


Pre-Trained Model Recommendation for Downstream Fine-tuning

http://arxiv.org/abs/2403.06382v1

Compressor summary: The paper presents Fennec, a framework for model selection in transfer learning that uses a large vision model to infer a new task's representation in a transfer-related subspace, where distances represent transferability, and archi2vec to encode models' structures.


Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

http://arxiv.org/abs/2403.06381v1

Compressor summary: The paper proposes attention regulation, a method to improve semantic fidelity in text-to-image synthesis by adjusting cross-attention layers during inference time without additional training or fine-tuning.


Eliminating Warping Shakes for Unsupervised Online Video Stitching

http://arxiv.org/abs/2403.06378v1

Compressor summary: The paper proposes StabStitch, a method that simultaneously performs video stitching and stabilization using unsupervised learning to reduce warping shakes and improve visual experience.


FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization

http://arxiv.org/abs/2403.06375v1

Compressor summary: The paper proposes FlowVQTalker, a method to create realistic talking faces with emotion-aware textures and lip synchronization using normalizing flows and vector quantization models.


FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables

http://arxiv.org/abs/2403.06367v1

Compressor summary: FEATAUG is a new feature augmentation framework that automatically generates predicate-aware SQL queries for one-to-many relationship tables, outperforming Featuretools and other baselines in effectiveness.


Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach

http://arxiv.org/abs/2403.06366v1

Compressor summary: This paper analyzes two types of soft Q-learning algorithms using switching system models and derives novel finite-time error bounds, contributing to a better understanding of soft Q-learning.


Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

http://arxiv.org/abs/2403.06365v1

Compressor summary: Style2Talker is a new method to generate expressive talking head videos from audio by using text-controlled emotion and picture-controlled art styles, which outperforms existing methods in lip sync and style control.


Say Anything with Any Style

http://arxiv.org/abs/2403.06363v1

Compressor summary: SAAS is a novel method for generating natural-looking talking head videos with diverse head motions and styles by using a multi-task VQ-VAE and a residual architecture.


See Through Their Minds: Learning Transferable Neural Representation from Cross-Subject fMRI

http://arxiv.org/abs/2403.06361v1

Compressor summary: Shallow subject-specific adapters help decode cross-subject fMRI data into unified representations, improving neural representation learning and reconstruction for both high-level and low-level perceptions.


Human and Automatic Interpretation of Romanian Noun Compounds

http://arxiv.org/abs/2403.06360v1

Compressor summary: The text discusses challenges in understanding noun compounds' meanings in NLP, proposes new relations for Romanian compounds, and tests them with humans and neural networks, finding that agreement tracks with frequency but no existing relation fully captures the meanings.


Video Generation with Consistency Tuning

http://arxiv.org/abs/2403.06356v1

Compressor summary: Key points: - The paper proposes a framework to generate long videos without jitter and noise - The framework consists of four modules for tuning, fusing, and ensuring consistency - The method outperforms existing approaches in video quality Summary: The paper presents a novel framework that uses four modules to produce high-quality long videos with consistent background and foreground without jitter and noise.


Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

http://arxiv.org/abs/2403.06355v1

Compressor summary: The paper presents a new method for multi-modal semantic understanding that aligns image and text features using CLIP-guided contrastive learning, achieving better results on sarcasm detection and sentiment analysis tasks.


Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages

http://arxiv.org/abs/2403.06354v1

Compressor summary: The authors train an Amharic language model on limited data using data augmentation, translation models, and multimodal learning, and release their methods and resources.


Exploring Hardware Friendly Bottleneck Architecture in CNN for Embedded Computing Systems

http://arxiv.org/abs/2403.06352v1

Compressor summary: The paper presents a lightweight CNN architecture, L-Mobilenet, that adapts well to embedded systems and achieves better performance than existing models with fewer parameters and less delay.


Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

http://arxiv.org/abs/2403.06351v1

Compressor summary: The paper presents a method called Exo2Ego that converts third-person videos to first-person views, using a two-stage approach, and introduces a benchmark for evaluating this task.


IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

http://arxiv.org/abs/2403.06350v1

Compressor summary: The text introduces a suite of resources for developing Indic language LLMs, covering 22 languages, using curated, valuable, and synthetic data from various sources, and addressing toxicity alignment.


MOAB: Multi-Modal Outer Arithmetic Block For Fusion Of Histopathological Images And Genetic Data For Brain Tumor Grading

http://arxiv.org/abs/2403.06349v1

Compressor summary: The paper proposes a novel method to combine histological images and genetic data for computer-aided diagnosis of brain tumor grades, using a Multi-modal Outer Arithmetic Block (MOAB) that applies arithmetic operations to latent representations of different modalities.