arxiv compressed, 2024-03-18

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-03-18 generated by the compressor, my personal LLM-based project.


P-MapNet: Far-seeing Map Generator Enhanced by both SDMap and HDMap Priors

http://arxiv.org/abs/2403.10521v1

Compressor summary: P-MapNet uses map priors from both SDMap and HDMap to improve online map generation for autonomous vehicles in regions without high-definition maps.


Strong and Controllable Blind Image Decomposition

http://arxiv.org/abs/2403.10520v1

Compressor summary: The controllable blind image decomposition network allows users to choose which degradations to remove or keep in an image, using a minimal amount of computational resources.


Frozen Feature Augmentation for Few-Shot Image Classification

http://arxiv.org/abs/2403.10519v1

Compressor summary: The paper explores applying data augmentations to frozen features in few-shot image classification tasks, finding that simple pointwise FroFA improves performance consistently across different settings.


Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives

http://arxiv.org/abs/2403.10518v1

Compressor summary: Lodge is a network that generates long dance routines based on music using two diffusion stages and foot refinement, achieving physical realism and expressive motions.


VideoAgent: Long-form Video Understanding with Large Language Model as Agent

http://arxiv.org/abs/2403.10517v1

Compressor summary: VideoAgent is a system that uses an agent-based approach to reason interactively and plan for long-form video understanding, outperforming existing methods on two benchmarks.


FeatUp: A Model-Agnostic Framework for Features at Any Resolution

http://arxiv.org/abs/2403.10516v1

Compressor summary: FeatUp is a framework to restore spatial resolution in deep features for computer vision tasks such as segmentation and depth prediction.


A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction

http://arxiv.org/abs/2403.10511v1

Compressor summary: The paper introduces a novel framework that jointly predicts the gaze target and social gaze label for multiple people in a scene using a temporal, transformer-based architecture and a new dataset.


Belief Change based on Knowledge Measures

http://arxiv.org/abs/2403.10502v1

Compressor summary: The text proposes a new quantitative framework for belief change based on knowledge measures that minimizes surprise and satisfies AGM postulates, introducing information measures for contraction, expansion, and revision operations.


Approximate Nullspace Augmented Finetuning for Robust Vision Transformers

http://arxiv.org/abs/2403.10476v1

Compressor summary: The authors propose a method to improve vision transformers' robustness by adding nullspace noise to the input during fine-tuning.


Understanding the Double Descent Phenomenon in Deep Learning

http://arxiv.org/abs/2403.10459v1

Compressor summary: Double descent is when increasing model complexity past the interpolation point lowers test error due to inductive biases selecting smooth empirical risk minimizers.


Robust Shape Fitting for 3D Scene Abstraction

http://arxiv.org/abs/2403.10452v1

Compressor summary: The text introduces a novel method to infer simple geometric shapes (cuboids) from complex real-world scenes using depth maps and neural networks, without requiring manual annotations.


Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

http://arxiv.org/abs/2403.10446v1

Compressor summary: The authors propose a system that uses Retrieval Augmented Generation to improve the accuracy of Large Language Models for domain-specific and time-sensitive queries, but finetuning with small and skewed datasets has limitations.


Optimal Block-Level Draft Verification for Accelerating Speculative Decoding

http://arxiv.org/abs/2403.10444v1

Compressor summary: The paper proposes a block-level verification algorithm for speculative decoding that improves wall-clock speedup by considering a wider range of draft verification algorithms and obtaining higher accepted tokens in expectation.


Using an LLM to Turn Sign Spottings into Spoken Language Sentences

http://arxiv.org/abs/2403.10434v1

Compressor summary: Spotter+GPT combines a sign spotter with a large language model to generate spoken sentences from sign language videos.


SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians

http://arxiv.org/abs/2403.10427v1

Compressor summary: The paper proposes a method to improve 3D Gaussian Splatting for learning 3D scenes from unstructured in-the-wild photos by modeling appearance and handling occluders.


NeuFlow: Real-time, High-accuracy Optical Flow Estimation on Robots Using Edge Devices

http://arxiv.org/abs/2403.10425v1

Compressor summary: The paper proposes NeuFlow, an efficient learning-based optical flow method that achieves high accuracy and speedup compared to state-of-the-art methods, enabling real-time computer vision tasks on edge computing platforms like drones.


Structured Evaluation of Synthetic Tabular Data

http://arxiv.org/abs/2403.10424v1

Compressor summary: The authors propose an evaluation framework for synthetic tabular data quality based on a single objective that the synthetic data should match the observed data distribution, and show that structured synthesizers perform better than others.


Robust Sparse Estimation for Gaussians with Optimal Error under Huber Contamination

http://arxiv.org/abs/2403.10416v1

Compressor summary: The paper proposes efficient and optimal robust estimators for Gaussian sparse estimation tasks, such as mean estimation, PCA, and linear regression, using a new multidimensional filtering method.


Gradient based Feature Attribution in Explainable AI: A Technical Review

http://arxiv.org/abs/2403.10415v1

Compressor summary: The text discusses the need for explainable AI, especially for neural networks, and presents a taxonomy of gradient based explanation methods along with challenges and evaluations.


Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search

http://arxiv.org/abs/2403.10413v1

Compressor summary: The paper proposes a method to efficiently integrate multi-head self-attention into high resolution representation CNNs for image segmentation using architecture search, achieving better efficiency and effectiveness than previous methods.


A comparative study on machine learning approaches for rock mass classification using drilling data

http://arxiv.org/abs/2403.10404v1

Compressor summary: The study develops models to use Measure While Drilling data for accurate and automated rock mass quality classification, supporting decision making in tunnel engineering.


Energy Correction Model in the Feature Space for Out-of-Distribution Detection

http://arxiv.org/abs/2403.10403v1

Compressor summary: The paper proposes an EBM-based method for OOD detection that outperforms KNN on CIFAR-10/CIFAR-100 benchmarks.


Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding

http://arxiv.org/abs/2403.10395v1

Compressor summary: Isotropic3D is an image-to-3D generation pipeline that uses a CLIP embedding and a text-to-3D diffusion model fine-tuned with Explicit Multi-view Attention to generate consistent, symmetrical, and less distorted 3D models.


CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning

http://arxiv.org/abs/2403.10391v1

Compressor summary: CDMAD is a new SSL algorithm that addresses class imbalance and distribution mismatch by refining biased pseudo-labels and ensuring a neutral classifier.


Evaluating Perceptual Distances by Fitting Binomial Distributions to Two-Alternative Forced Choice Data

http://arxiv.org/abs/2403.10390v1

Compressor summary: The paper proposes a new method for evaluating perceptual distances using statistical modeling of binomial distributions fitted to human judgments in two-alternative forced choice experiments, improving on previous binary decision approaches.


Monotonic Representation of Numeric Properties in Language Models

http://arxiv.org/abs/2403.10381v1

Compressor summary: The paper introduces a method to find and edit numeric property representations in language models, showing how changing these representations can alter the model's output.


Regret Minimization via Saddle Point Optimization

http://arxiv.org/abs/2403.10379v1

Compressor summary: The paper presents an anytime variant of the Estimation-To-Decisions algorithm that optimizes exploration-exploitation trade-off online for sequential decision-making in structured bandits and reinforcement learning.


EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models

http://arxiv.org/abs/2403.10378v1

Compressor summary: EXAMS-V is a new multilingual exam benchmark that tests vision language models' abilities to reason over text and images from diverse disciplines and regions.


PASTA: Towards Flexible and Efficient HDR Imaging Via Progressively Aggregated Spatio-Temporal Aligment

http://arxiv.org/abs/2403.10376v1

Compressor summary: PASTA is a novel framework for HDR deghosting that uses hierarchical representation and feature distanglement to achieve both effectiveness and efficiency, with a significant 3x speedup compared to current methods.


Towards a general framework for improving the performance of classifiers using XAI methods

http://arxiv.org/abs/2403.10373v1

Compressor summary: The paper proposes a framework to use XAI techniques to improve pre-trained DL classifiers without retraining them, using either auto-encoder or encoder-decoder based learning strategies.


An Energy-Efficient Ensemble Approach for Mitigating Data Incompleteness in IoT Applications

http://arxiv.org/abs/2403.10371v1

Compressor summary: This paper proposes a technique called ENAMLE that reduces energy consumption and improves performance in IoT-based machine learning systems facing data incompleteness due to missing sensor readings.


Open Stamped Parts Dataset

http://arxiv.org/abs/2403.10369v1

Compressor summary: The Open Stamped Parts Dataset (OSPD) contains real and synthetic images of metal sheets with annotations for defect detection, which can help improve automotive manufacturing and computer vision.


Testing MediaPipe Holistic for Linguistic Analysis of Nonmanual Markers in Sign Languages

http://arxiv.org/abs/2403.10367v1

Compressor summary: The paper evaluates MediaPipe Holistic and OpenFace for landmark tracking of sign languages and finds that both methods need further improvement for linguistic analysis of eyebrow movements.


ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image

http://arxiv.org/abs/2403.10357v1

Compressor summary: ANIM is a novel method that accurately reconstructs 3D human shapes from single-view RGB-D images by incorporating depth observations and leveraging multi-resolution features to overcome depth ambiguities.


SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

http://arxiv.org/abs/2403.10353v1

Compressor summary: The paper presents SimPB, a single model that simultaneously detects 2D and 3D objects from multiple cameras using a hybrid decoder and dynamic query allocation modules.


TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale

http://arxiv.org/abs/2403.10351v1

Compressor summary: TriSum is a framework that distills large language models' text summarization abilities into a compact, local model, improving performance and interpretability on various benchmarks.


ParaPoint: Learning Global Free-Boundary Surface Parameterization of 3D Point Clouds

http://arxiv.org/abs/2403.10349v1

Compressor summary: Key points: - Paper proposes ParaPoint, an unsupervised neural network for parameterizing point clouds with UV coordinates - Uses sub-networks with specific functionalities and bi-directional cycle mapping framework - Introduces effective loss functions and differential geometric constraints - First attempt to achieve global mappings and free boundaries with neural point cloud parameterization Summary: The paper presents ParaPoint, a novel neural network that maps point clouds to UV coordinates with adaptive boundaries, using sub-networks, cycle mapping, and geometric constraints.


Denoising Task Difficulty-based Curriculum for Training Diffusion Models

http://arxiv.org/abs/2403.10348v1

Compressor summary: The paper investigates the difficulty of denoising tasks in diffusion-based generative models across different timesteps and proposes an easy-to-hard learning scheme to improve performance and convergence.


SCILLA: SurfaCe Implicit Learning for Large Urban Area, a volumetric hybrid solution

http://arxiv.org/abs/2403.10344v1

Compressor summary: SCILLA is a new method for reconstructing large outdoor scenes from images using two implicit fields, one for density and another for distance to the surface, with a novel volume-rendering strategy that works without geometric priors.


Thermal-NeRF: Neural Radiance Fields from an Infrared Camera

http://arxiv.org/abs/2403.10340v1

Compressor summary: Thermal-NeRF is a novel method that uses infrared cameras to reconstruct 3D scenes with high detail under poor lighting conditions, outperforming existing methods.


Generation is better than Modification: Combating High Class Homophily Variance in Graph Anomaly Detection

http://arxiv.org/abs/2403.10339v1

Compressor summary: HedGe is a new GNN model that generates low-variance edges to improve anomaly detection and robustness in graphs.


Investigating grammatical abstraction in language models using few-shot learning of novel noun gender

http://arxiv.org/abs/2403.10338v1

Compressor summary: The text describes an experiment comparing an LSTM and a transformer model's ability to learn and generalize grammatical gender in French, similar to human learning, and finds that both models show a masculine gender bias.


How Powerful Potential of Attention on Image Restoration?

http://arxiv.org/abs/2403.10336v1

Compressor summary: The paper proposes Continuous Scaling Attention (CSAttn), a new attention mechanism for image restoration without using feed-forward networks, and demonstrates its effectiveness on various tasks.


NECA: Neural Customizable Human Avatar

http://arxiv.org/abs/2403.10335v1

Compressor summary: Key points: - NECA is an approach to learn versatile human representation from videos - It predicts disentangled neural fields for geometry, albedo, shadow, and lighting - It enables realistic rendering and editing of human avatars Summary: NECA learns human avatars from videos and allows customization of their appearance and illumination.


Towards Non-Adversarial Algorithmic Recourse

http://arxiv.org/abs/2403.10330v1

Compressor summary: The paper explores the differences between adversarial examples and counterfactual explanations, and argues for obtaining non-adversarial algorithmic recourse in high-stakes situations.


CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model

http://arxiv.org/abs/2403.10326v1

Compressor summary: This paper proposes a method to generate cloze distractors using pre-trained language models, improving learner ability assessment effectiveness.


Anytime Neural Architecture Search on Tabular Data

http://arxiv.org/abs/2403.10318v1

Compressor summary: ATLAS is an efficient anytime Neural Architecture Search approach for tabular data that uses a two-phase filtering-and-refinement optimization scheme and reduces search time by up to 82.75x.


KIF: A Framework for Virtual Integration of Heterogeneous Knowledge Bases using Wikidata

http://arxiv.org/abs/2403.10304v1

Compressor summary: KIF is a framework that integrates different knowledge bases using Wikidata as a common language and provides a unified view and querying options.


Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

http://arxiv.org/abs/2403.10301v1

Compressor summary: Uni-SMART is a multimodal model that improves the analysis of scientific literature by understanding its various elements, such as molecular structure, tables, and charts, and outperforms existing text-focused models in several domains.


A Multi-constraint and Multi-objective Allocation Model for Emergency Rescue in IoT Environment

http://arxiv.org/abs/2403.10299v1

Compressor summary: The paper presents MSGW-FLM, a new resource allocation model for emergency relief operations that uses IoT and spatio-temporal data analytics to optimize decision-making in complex disaster scenarios.


Context-Semantic Quality Awareness Network for Fine-Grained Visual Categorization

http://arxiv.org/abs/2403.10298v1

Compressor summary: The paper proposes a novel network (CSQA-Net) for fine-grained visual categorization that uses cross-attention and semantic quality evaluation to improve feature representations and discriminability.


Leveraging Neural Radiance Field in Descriptor Synthesis for Keypoints Scene Coordinate Regression

http://arxiv.org/abs/2403.10297v1

Compressor summary: The paper proposes a method to improve keypoint scene coordinate regression (KSCR) by synthesizing novel keypoint descriptors using Neural Radiance Fields, enhancing localization accuracy and generalization in data-scarce environments.


MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank

http://arxiv.org/abs/2403.10293v1

Compressor summary: The MaiBaam corpus is the first multi-dialect Bavarian treebank with UD annotation, covering various text genres and illustrating morphosyntactic differences between Bavarian and German.


Deep Learning for Multi-Level Detection and Localization of Myocardial Scars Based on Regional Strain Validated on Virtual Patients

http://arxiv.org/abs/2403.10291v1

Compressor summary: Key points: - The text proposes a CNN-based framework to predict myocardial disease substrates from regional strain patterns - The method converts clinical standard bullseye representation to a multi-channel 2D image for image classification - The method achieves high accuracy in detecting and localizing myocardial scar on simulated data Summary: The text presents a CNN-based framework that uses regional strain patterns as input to predict and locate myocardial disease substrates, such as scar, from clinical standard data.


Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models

http://arxiv.org/abs/2403.10287v1

Compressor summary: VISE is a training-free method that converts few-shot image classification and segmentation into VQA using VLMs and off-the-shelf vision models, achieving state-of-the-art results.


Local positional graphs and attentive local features for a data and runtime-efficient hierarchical place recognition pipeline

http://arxiv.org/abs/2403.10283v1

Compressor summary: The paper proposes a hierarchical Visual Place Recognition pipeline that uses training-free and data-efficient local feature encoding, attention module, and hyperdimensional computing to achieve better performance, speed, and storage compared to existing methods.


Team Trifecta at Factify5WQA: Setting the Standard in Fact Verification with Fine-Tuning

http://arxiv.org/abs/2403.10281v1

Compressor summary: Pre-CoFactv3 is a framework that uses In-Context Learning, Fine-tuned LLMs, and FakeNet to improve fact verification accuracy and won first place in the AAAI-24 Factify 3.0 Workshop.


A Question on the Explainability of Large Language Models and the Word-Level Univariate First-Order Plausibility Assumption

http://arxiv.org/abs/2403.10275v1

Compressor summary: The paper proposes a way to measure the quality of explanations from large language models and shows that simpler models have clearer explanations than transformers.


Towards Generalizable Deepfake Video Detection with Thumbnail Layout and Graph Reasoning

http://arxiv.org/abs/2403.10261v1

Compressor summary: This paper proposes a simple and effective method called Thumbnail Layout (TALL) for deepfake video detection that transforms clips into pre-defined layouts, and enhances it with a graph reasoning block and semantic consistency loss to improve performance.


Comprehensive Study Of Predictive Maintenance In Industries Using Classification Models And LSTM Model

http://arxiv.org/abs/2403.10259v1

Compressor summary: The study compares various machine learning algorithms to determine the best one for predicting and analyzing machine performance in maintenance applications.


Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

http://arxiv.org/abs/2403.10258v1

Compressor summary: The authors show that English-centric large language models may not perform well on culture-related tasks when prompted in English, and suggest developing stronger multilingual models instead.


Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder

http://arxiv.org/abs/2403.10255v1

Compressor summary: The paper proposes a new method for super-resolution and image generation at arbitrary scales using a latent diffusion model, an auto-encoder, and an implicit neural decoder, which improves quality, diversity, and consistency while reducing memory and inference time.


Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

http://arxiv.org/abs/2403.10254v1

Compressor summary: The paper introduces a new framework called EDITOR that uses tokenized features from different modalities and adaptive selection to improve multi-modal object re-identification robustness and discrimination.


Open Continual Feature Selection via Granular-Ball Knowledge Transfer

http://arxiv.org/abs/2403.10253v1

Compressor summary: The paper proposes a novel framework that combines continual learning and granular-ball computing for feature selection in data preprocessing, addressing unknown classes and knowledge transfer challenges.


Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning

http://arxiv.org/abs/2403.10252v1

Compressor summary: The study proposes a novel method for partially supervised multi-task dense prediction by aligning region-wise Gaussian distributions to capture cross-task relationships, achieving state-of-the-art results on two benchmarks.


A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges

http://arxiv.org/abs/2403.10249v1

Compressor summary: This paper reviews LM-based Agents for games, their challenges, and suggests future research directions.


CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning

http://arxiv.org/abs/2403.10245v1

Compressor summary: The paper presents CoLeCLIP, a novel method to learn open-domain continual learning of vision-language models using task prompts and a cross-domain class vocabulary, improving performance on 11 domain datasets.


FDGaussian: Fast Gaussian Splatting from Single Image via Geometric-aware Diffusion Model

http://arxiv.org/abs/2403.10242v1

Compressor summary: The paper presents FDGaussian, a novel framework for single-image 3D reconstruction that uses orthogonal plane decomposition and epipolar attention to generate consistent multi-view images and high-quality 3D objects.


A Big Data Approach to Understand Sub-national Determinants of FDI in Africa

http://arxiv.org/abs/2403.10239v1

Compressor summary: The paper presents a new method to analyze news articles and social networks to understand how regional factors influence FDI in African companies.


A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream

http://arxiv.org/abs/2403.10237v1

Compressor summary: The study aims to improve topic detection in Persian language by adapting existing methods and comparing their performance on social network texts using a new evaluation criterion.


A Fixed-Point Approach to Unified Prompt-Based Counting

http://arxiv.org/abs/2403.10236v1

Compressor summary: The paper proposes a method for generating density maps for object counting using various prompt types, improving accuracy with fixed-point inference and contrastive training.


Less is More: One-shot Subgraph Reasoning on Large-scale Knowledge Graphs

http://arxiv.org/abs/2403.10231v1

Compressor summary: One-shot-subgraph link prediction efficiently predicts links in knowledge graphs by extracting a query-dependent subgraph and using Personalized PageRank to identify potential answers and evidence.


HawkEye: Training Video-Text LLMs for Grounding Text in Videos

http://arxiv.org/abs/2403.10228v1

Compressor summary: HawkEye is a new video-text language model that can answer questions about long videos by using temporal information, thanks to a large-scale dataset and two new training objectives.


From Chaos to Clarity: Time Series Anomaly Detection in Astronomical Observations

http://arxiv.org/abs/2403.10220v1

Compressor summary: AERO is a novel framework for anomaly detection in astronomical observations that uses a Transformer encoder-decoder and a window-wise graph neural network to distinguish normal patterns from concurrent noise, reducing false alarms.


Exploring Optical Flow Inclusion into nnU-Net Framework for Surgical Instrument Segmentation

http://arxiv.org/abs/2403.10216v1

Compressor summary: The study proposes using optical flow maps as an additional input to the nnU-Net framework for improved surgical instrument segmentation in laparoscopy by leveraging the temporal information of moving objects.


Enhanced Coherence-Aware Network with Hierarchical Disentanglement for Aspect-Category Sentiment Analysis

http://arxiv.org/abs/2403.10214v1

Compressor summary: ECAN is a novel model for aspect-category-based sentiment analysis that uses coherence modeling and hierarchical disentanglement to identify and extract multiple aspect categories and their sentiments from reviews.


BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution

http://arxiv.org/abs/2403.10211v1

Compressor summary: BlindDiff is a diffusion model-based method for blind image super-resolution that integrates MAP optimization, modulated conditional transformer, and kernel-aware gradient to handle complex unknown degradations and achieve state-of-the-art performance.


Read between the lines -- Functionality Extraction From READMEs

http://arxiv.org/abs/2403.10205v1

Compressor summary: The paper proposes functionality extraction from Git README files, a challenging text generation task, and introduces FuncRead dataset and models that outperform existing large language models for this task.


Generative Region-Language Pretraining for Open-Ended Object Detection

http://arxiv.org/abs/2403.10191v1

Compressor summary: GenerateU is a framework for generative open-ended object detection that can detect and name objects without predefined categories, using Deformable DETR and a language model.


Perceptual Quality-based Model Training under Annotator Label Uncertainty

http://arxiv.org/abs/2403.10190v1

Compressor summary: The text discusses annotator label uncertainty, its negative effects on model performance, and proposes a new method to generate multiple labels for training models without requiring massive annotations.


Can Factual Statements be Deceptive? The DeFaBel Corpus of Belief-based Deception

http://arxiv.org/abs/2403.10185v1

Compressor summary: The DeFaBel corpus explores the relation between deception and factuality based on personal belief, and shows that people are more confident in arguments aligned with their beliefs.


Lifted Causal Inference in Relational Domains

http://arxiv.org/abs/2403.10184v1

Compressor summary: Lifted inference speeds up causal inference in relational domains by using symmetries and representative objects.


Reliable uncertainty with cheaper neural network ensembles: a case study in industrial parts classification

http://arxiv.org/abs/2403.10182v1

Compressor summary: This study compares different neural network ensembles for uncertainly estimation in operations research, finding the batch ensemble to be cost-effective and competitive while reducing training and test time and memory usage.


Animate Your Motion: Turning Still Images into Dynamic Videos

http://arxiv.org/abs/2403.10179v1

Compressor summary: The SMCD method combines semantic and motion cues in a diffusion model to improve text-to-video generation by enhancing control over video outputs.


A Short Survey on Importance Weighting for Machine Learning

http://arxiv.org/abs/2403.10175v1

Compressor summary: Importance weighting is a simple and useful technique in statistics and machine learning that adjusts the objective function or probability distribution based on the importance of instances, with many applications such as handling distribution shift.


A Hybrid SNN-ANN Network for Event-based Object Detection with Spatial and Temporal Attention

http://arxiv.org/abs/2403.10173v1

Compressor summary: The authors propose a novel hybrid attention-based spiking neural network (SNN) and artificial neural network (ANN) architecture for object detection using event cameras, achieving ANN-like performance with reduced parameters and low latency on neuromorphic hardware.


AUTONODE: A Neuro-Graphic Self-Learnable Engine for Cognitive GUI Automation

http://arxiv.org/abs/2403.10171v1

Compressor summary: AUTONODE is an AI system that uses neuro-graphical techniques and learning from experience to autonomously navigate and execute complex tasks on web interfaces without predefined scripts or manual intervention.


Computer User Interface Understanding. A New Dataset and a Learning Framework

http://arxiv.org/abs/2403.10170v1

Compressor summary: The paper introduces computer UI understanding, a challenging task, and presents a dataset of videos showing user actions on desktop interfaces and a framework using synthetic samples and contrastive learning for fine-grain UI classification.


Explainability through uncertainty: Trustworthy decision-making with neural networks

http://arxiv.org/abs/2403.10168v1

Compressor summary: The paper proposes a general uncertainty framework for neural networks that links uncertainty estimation to explainable AI, reduces misclassifications with human expert input, and improves trustworthiness under distribution shifts.


Efficient Detection of Exchangeable Factors in Factor Graphs

http://arxiv.org/abs/2403.10167v1

Compressor summary: The paper introduces a new algorithm, DEFT, which efficiently finds exchangeable factors in a factor graph, enabling faster probabilistic inference with lifted models.


SemanticHuman-HD: High-Resolution Semantic Disentangled 3D Human Generation

http://arxiv.org/abs/2403.10166v1

Compressor summary: SemanticHuman-HD is a novel method that achieves semantic disentangled human image synthesis at high resolution using 3D-aware super-resolution and reduced computational cost.


CoReEcho: Continuous Representation Learning for 2D+time Echocardiography Analysis

http://arxiv.org/abs/2403.10164v1

Compressor summary: CoReEcho is a novel training framework for echocardiogram analysis that improves performance, explainability, and generalization compared to existing deep learning models.


Online Policy Learning from Offline Preferences

http://arxiv.org/abs/2403.10160v1

Compressor summary: The paper proposes a framework that combines offline and virtual preferences for learning reward functions in preference-based reinforcement learning, improving generalizability and guidance for agents.


Functional Graph Convolutional Networks: A unified multi-task and multi-modal learning framework to facilitate health and social-care insights

http://arxiv.org/abs/2403.10158v1

Compressor summary: The paper presents a novel framework called funGCN that combines Functional Data Analysis and Graph Convolutional Networks to handle multi-task and multi-modal learning in digital health and longitudinal studies, ensuring interpretability and performance.


Improving Medical Multi-modal Contrastive Learning with Expert Annotations

http://arxiv.org/abs/2403.10153v1

Compressor summary: eCLIP is an improved CLIP model that uses radiologist eye-gaze heatmaps as expert annotations to enhance multi-modal medical imaging analysis, particularly addressing data scarcity and modality gap issues.


GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time

http://arxiv.org/abs/2403.10147v1

Compressor summary: GGRt is a novel framework that enables fast and efficient generalizable novel view synthesis without real camera poses using 3D Gaussian Splatting and joint learning.


RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception

http://arxiv.org/abs/2403.10145v1

Compressor summary: Key points: - Roadside perception is important for autonomous driving and traffic management - Existing approaches have limitations in sensing range and blind spots - RCooper aims to achieve area-coverage roadside perception for restricted traffic areas - The paper releases the first large-scale RCooper dataset with annotations Summary: The paper introduces RCooper, a new approach for area-coverage roadside perception in autonomous driving and traffic management, and releases the first large-scale annotated dataset for it.


NLP Verification: Towards a General Methodology for Certifying Robustness

http://arxiv.org/abs/2403.10144v1

Compressor summary: The paper proposes a methodology to evaluate and improve the safety and reliability of deep neural networks in natural language processing by addressing technical challenges and introducing new metrics for verification pipelines.


E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

http://arxiv.org/abs/2403.10133v1

Compressor summary: The paper proposes a zero-shot image editing method, E4C, which enhances editability and alignment with CLIP guidance using a dual-branch feature-sharing pipeline.


RAFT: Adapting Language Model to Domain Specific RAG

http://arxiv.org/abs/2403.10131v1

Compressor summary: The paper introduces RAFT, a method that trains large language models to answer questions by ignoring distractor documents and citing the relevant sequence from them.


TransLandSeg: A Transfer Learning Approach for Landslide Semantic Segmentation Based on Vision Foundation Model

http://arxiv.org/abs/2403.10127v1

Compressor summary: TransLandSeg is a transfer learning approach for landslide detection that improves efficiency and accuracy by adapting the powerful segmentation capability of SAM with only 1.3% of its parameters.


Depth-induced Saliency Comparison Network for Diagnosis of Alzheimer's Disease via Jointly Analysis of Visual Stimuli and Eye Movements

http://arxiv.org/abs/2403.10124v1

Compressor summary: DISCN is a network that uses eye movement analysis with salient attention and serial attention modules to detect Alzheimer's disease from visual stimuli.


Regularization-Based Efficient Continual Learning in Deep State-Space Models

http://arxiv.org/abs/2403.10123v1

Compressor summary: CLDSSMs are a novel approach that combines deep state-space models with continual learning methods to adapt to evolving tasks without forgetting previous ones.


URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields

http://arxiv.org/abs/2403.10119v1

Compressor summary: The proposed method improves NeRF performance by estimating camera poses and velocities from unordered rolling shutter images without sequential data constraints.


Single- and Multi-Agent Private Active Sensing: A Deep Neuroevolution Approach

http://arxiv.org/abs/2403.10112v1

Compressor summary: The paper proposes NeuroEvolution-based methods for centralized and decentralized active hypothesis testing with eavesdroppers, showing improved performance in anomaly detection over wireless sensor networks.


Meta Operator for Complex Query Answering on Knowledge Graphs

http://arxiv.org/abs/2403.10110v1

Compressor summary: The text proposes a meta-learning algorithm for improving complex query answering by learning meta-operators, which are more generalizable than the existing approaches.


Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

http://arxiv.org/abs/2403.10107v1

Compressor summary: The paper proposes a framework that uses multiple large language models to improve video-based human-object interaction detection and enhance the decision-making of robots and autonomous systems.


CSDNet: Detect Salient Object in Depth-Thermal via A Lightweight Cross Shallow and Deep Perception Network

http://arxiv.org/abs/2403.10104v1

Compressor summary: CSDNet is a lightweight network that uses spatial information prescreening and implicit coherence navigation to integrate two less-coherent modalities for salient object detection in robotic perception, outperforming methods using RGB-T or RGB-D modalities with faster runtime and fewer FLOPs.


DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video

http://arxiv.org/abs/2403.10103v1

Compressor summary: DyBluRF is a dynamic NeRF method that synthesizes sharp novel views from motion-blurred monocular videos by capturing camera and object trajectories and using cross-time rendering.


KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation

http://arxiv.org/abs/2403.10099v1

Compressor summary: KP-RED is a method to match object scans with CAD models using sparse keypoints, embedding space, and neural cage deformation.


DiffMAC: Diffusion Manifold Hallucination Correction for High Generalization Blind Face Restoration

http://arxiv.org/abs/2403.10098v1

Compressor summary: The paper proposes a Diffusion-Information-Diffusion framework to correct face degradation across diverse scenarios using AdaIN and manifold information bottleneck, achieving high generalization and quality.


Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks

http://arxiv.org/abs/2403.10097v1

Compressor summary: AdaRand is a simple method for improving fine-tuning of deep neural networks without auxiliary source information by adapting feature vector distributions using class conditional Gaussian distributions.


RangeLDM: Fast Realistic LiDAR Point Cloud Generation

http://arxiv.org/abs/2403.10094v1

Compressor summary: RangeLDM is a novel approach that uses latent diffusion models to rapidly generate high-quality range-view LiDAR point clouds for autonomous driving with accurate projection, compression, and 3D structural fidelity preservation.


Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF

http://arxiv.org/abs/2403.10088v1

Compressor summary: CoARL is a new framework that generates effective counterspeech for online hate speech by modeling social biases and using multi-instruction tuning and reinforcement learning.


Monkeypox disease recognition model based on improved SE-InceptionV3

http://arxiv.org/abs/2403.10087v1

Compressor summary: The study presents an improved SE-InceptionV3 model that detects monkeypox with 96.71% accuracy using SENet and L2 regularization, outperforming conventional and deep learning methods.


VRHCF: Cross-Source Point Cloud Registration via Voxel Representation and Hierarchical Correspondence Filtering

http://arxiv.org/abs/2403.10085v1

Compressor summary: Key points: - A novel framework for point cloud registration with broad applicability - Spherical voxel feature representation to handle different densities and distributions - Hierarchical correspondence filtering to remove outliers and mismatches - High performance in both homologous and cross-source registration scenarios - Code available at https://github.com/GuiyuZhao/VRHCF Summary: The authors present a versatile point cloud registration framework that uses spherical voxels and hierarchical correspondence filtering to achieve high performance in both homologous and cross-source registration scenarios, with code available online.


CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

http://arxiv.org/abs/2403.10082v1

Compressor summary: Key points: - Proposed method uses text descriptions from LLMs to guide feature learning for one-shot action recognition - Text descriptions are used in a global-local-global way to focus on informative joints and form global representation - Dual-branch architecture allows inference without text input and reduces cost compared to base encoder - Outperforms existing methods and can enhance other skeleton encoders Summary: The paper proposes a method that uses text descriptions from LLMs to guide feature learning for one-shot action recognition, achieving better performance and efficiency than existing methods.


DRAGIN: Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models

http://arxiv.org/abs/2403.10081v1

Compressor summary: DRAGIN is a new framework that improves text generation by dynamically deciding when and what to retrieve based on the real-time information needs of large language models.


Learning Physical Dynamics for Object-centric Visual Prediction

http://arxiv.org/abs/2403.10079v1

Compressor summary: The paper proposes an unsupervised object-centric prediction model that learns visual dynamics between objects for future predictions, outperforming existing methods in visual quality and physical reliability.


A survey of synthetic data augmentation methods in computer vision

http://arxiv.org/abs/2403.10075v1

Compressor summary: This paper reviews various techniques for creating synthetic data to augment computer vision tasks when real data is scarce or unavailable, covering approaches based on 3D graphics, neural style transfer, differential neural rendering, GANs, and VAEs.


Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

http://arxiv.org/abs/2403.10071v1

Compressor summary: VQCT is a new framework that uses language model-pretrained codebooks and part-of-speech knowledge to improve image synthesis with vector-quantized image modeling.


Boundary Matters: A Bi-Level Active Finetuning Framework

http://arxiv.org/abs/2403.10069v1

Compressor summary: The paper proposes a new active finetuning framework that selects samples for annotation based on diversity and uncertainty using pseudo-class centers, denoising, and iterative boundary sample selection in high-dimensional feature space.


What Makes Good Collaborative Views? Contrastive Mutual Information Maximization for Multi-Agent Perception

http://arxiv.org/abs/2403.10068v1

Compressor summary: The paper proposes a new framework for multi-agent perception that improves collaboration, preserves individual view information, and reduces communication volume.


Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment

http://arxiv.org/abs/2403.10066v1

Compressor summary: CoPA is a novel contrastive pre-training framework for point cloud quality assessment that learns quality-aware representations from unlabeled data using mixed images with multiple distortions, improving generalization and performance over existing methods.


Triple GNNs: Introducing Syntactic and Semantic Information for Conversational Aspect-Based Quadruple Sentiment Analysis

http://arxiv.org/abs/2403.10065v1

Compressor summary: The Triple GNNs network improves dialogue sentiment analysis by using graph convolution and attention to capture syntactic dependencies within utterances and inter-utterance interactions.


Unified Projection-Free Algorithms for Adversarial DR-Submodular Optimization

http://arxiv.org/abs/2403.10063v1

Compressor summary: The paper presents new projection-free algorithms for continuous DR-submodular optimization with various scenarios and constraints, achieving sub-linear regret bounds in both non-monotone and monotone settings.


PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment

http://arxiv.org/abs/2403.10061v1

Compressor summary: The paper proposes a self-supervised pre-training framework using masked autoencoders to improve no-reference point cloud quality assessment without labeled data.


RID-TWIN: An end-to-end pipeline for automatic face de-identification in videos

http://arxiv.org/abs/2403.10058v1

Compressor summary: RID-Twin is a new method that uses advanced generative models to de-identify faces in videos by separating identity from motion, while addressing various challenges in the field.


Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning

http://arxiv.org/abs/2403.10056v1

Compressor summary: The paper proposes a novel continual instruction tuning method for large language models using Key-part Information Gain (KPIG) to alleviate catastrophic forgetting and improve task-aware information capture.


Control and Automation for Industrial Production Storage Zone: Generation of Optimal Route Using Image Processing

http://arxiv.org/abs/2403.10054v1

Compressor summary: The article presents an industrial automation method using digital image processing and optimization techniques for generating optimal routes in a warehouse area.


Group-Mix SAM: Lightweight Solution for Industrial Assembly Line Applications

http://arxiv.org/abs/2403.10053v1

Compressor summary: The authors propose Group-Mix SAM, a lightweight version of MobileSAM, which can be deployed in practical assembly line scenarios due to its reduced size and computational cost, while maintaining similar performance.


T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory

http://arxiv.org/abs/2403.10052v1

Compressor summary: The paper proposes a masked autoencoder (MAE) for trajectory prediction that learns from actor-specific token memory and adapts to distribution shifts, improving accuracy and efficiency over existing methods.


Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing

http://arxiv.org/abs/2403.10050v1

Compressor summary: Texture-GS is a novel approach to disentangle appearance and geometry in 3D Gaussian splatting, enabling high-fidelity appearance editing and real-time rendering.


TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

http://arxiv.org/abs/2403.10047v1

Compressor summary: The proposed scene text spotter uses a pre-trained language model and text block detection to recognize texts in images without precise detection, achieving better performance on complex scenarios.


SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

http://arxiv.org/abs/2403.10044v1

Compressor summary: SphereDiffusion is a novel framework for generating high-quality, controllable spherical panoramic images by addressing unique challenges such as spherical distortion and geometry characteristics using text encoding, deformable techniques, and improved data diversity.


Rethinking Low-quality Optical Flow in Unsupervised Surgical Instrument Segmentation

http://arxiv.org/abs/2403.10039v1

Compressor summary: The authors propose a method to improve unsupervised surgical instrument segmentation by enhancing optical flow quality and reducing manual annotations.


Knowledge Condensation and Reasoning for Knowledge-based VQA

http://arxiv.org/abs/2403.10037v1

Compressor summary: The text proposes two models that condense and reason with external knowledge to improve visual question answering, achieving state-of-the-art results.


SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception

http://arxiv.org/abs/2403.10036v1

Compressor summary: SparseFusion is a novel framework that uses sparse 3D features to enable efficient long-range 3D object detection, outperforming dense detectors and achieving state-of-the-art results on several tasks.


Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

http://arxiv.org/abs/2403.10030v1

Compressor summary: MCTF is a method to improve the efficiency and accuracy of Vision Transformers by fusing tokens based on multiple criteria and using one-step-ahead attention.


Lifelong Person Re-Identification with Backward-Compatibility

http://arxiv.org/abs/2403.10022v1

Compressor summary: The paper proposes a method to maintain compatibility with old models and reduce computational complexity in lifelong person re-identification by using cross-model compatibility loss and knowledge consolidation.


Lost in Overlap: Exploring Watermark Collision in LLMs

http://arxiv.org/abs/2403.10020v1

Compressor summary: Watermark collision is a problem for detecting text copyright in large language models, as two watermarks can interfere with each other's detection.


Linear optimal transport subspaces for point set classification

http://arxiv.org/abs/2403.10015v1

Compressor summary: The paper proposes a method for classifying point sets with spatial deformations using Linear Optimal Transport, which simplifies the problem and achieves competitive results.


Real-World Computational Aberration Correction via Quantized Domain-Mixing Representation

http://arxiv.org/abs/2403.10012v1

Compressor summary: The paper introduces a novel Domain Adaptive CAC (DACAC) approach that uses unpaired real-world data to improve the performance of Computational Aberration Correction in real-world applications, by proposing a Quntized Domain-Mixing Representation (QDMR) framework.


ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images

http://arxiv.org/abs/2403.10004v1

Compressor summary: Text-grounded Object Generation (TOG) is a new image editing scenario where text descriptions guide the creation of objects in real images, and ST-LDM is a framework that uses Swin-Transformer to improve spatial perception and attention in latent diffusion models.


Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

http://arxiv.org/abs/2403.10001v1

Compressor summary: The paper proposes VFMSeg, a novel pipeline that uses visual foundation models to improve cross-modal unsupervised domain adaptation for 3D point cloud segmentation by generating more accurate labels and enhancing neural networks with semantically augmented data.


FBPT: A Fully Binary Point Transformer

http://arxiv.org/abs/2403.09998v1

Compressor summary: The paper introduces a binary point cloud Transformer model that compresses neural network weights and activations for point cloud processing, and proposes a binarization mechanism called dynamic-static hybridization to address performance degradation.


Identifying Health Risks from Family History: A Survey of Natural Language Processing Techniques

http://arxiv.org/abs/2403.09997v1

Compressor summary: The text discusses how natural language processing and machine learning can help identify hereditary health risks from electronic health records for precision health applications.


MEDPNet: Achieving High-Precision Adaptive Registration for Complex Die Castings

http://arxiv.org/abs/2403.09996v1

Compressor summary: The paper presents MEDPNet, a high-precision adaptive point cloud registration method for complex Die Castings, and introduces DieCastCloud, a dataset tailored for this task.


TRG-Net: An Interpretable and Controllable Rain Generator

http://arxiv.org/abs/2403.09993v1

Compressor summary: The study presents a novel deep learning based rain generator that considers the physical generation mechanism of rains, simulates expected rains, adapts to diverse rainy images, and improves deraining and downstream tasks with more controllable and diverse samples.


Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

http://arxiv.org/abs/2403.09981v1

Compressor summary: The paper introduces MVControl, a neural network architecture for controllable text-to-3D generation, using input condition images and camera poses to guide optimization-based 3D creation with efficient multi-stage 3D Gaussians representation.


EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

http://arxiv.org/abs/2403.09977v1

Compressor summary: EfficientVMamba is a novel light-weight model that combines state space models and efficient skip sampling to achieve competitive performance in various vision tasks with reduced computational complexity.


AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors

http://arxiv.org/abs/2403.09976v1

Compressor summary: The paper proposes a method to distinguish between task-irrelevant visual distractors using Implicit Action Generator (IAG) and implicit action-informed world models, which improves performance on various visual control tasks with both heterogeneous and homogeneous distractors.


Skeleton-Based Human Action Recognition with Noisy Labels

http://arxiv.org/abs/2403.09975v1

Compressor summary: The authors propose a new method, NoiseEraSAR, to improve skeleton-based action recognition by reducing label noise and setting new state-of-the-art standards.


GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery

http://arxiv.org/abs/2403.09974v1

Compressor summary: The paper proposes TES, a method that uses CLIP to generate pseudo text embeddings for unlabelled samples and a dual-branch framework to enhance visual and semantic information in the GCD task, achieving state-of-the-art results.


Den-SOFT: Dense Space-Oriented Light Field DataseT for 6-DOF Immersive Experience

http://arxiv.org/abs/2403.09973v1

Compressor summary: Key points: - custom mobile multi-camera system for large-space dense light field capture - aim to contribute to 3D scene reconstruction algorithms and immersive VR/AR experiences - used 40 GoPro 10 cameras, captured images of 5k resolution, at least 1000 photos per scene - included elements such as sky, reflections, lights and shadows - validated dataset on three popular algorithms and integrated into Unity engine for VR realism Summary: The authors present a custom mobile system that captures high-quality and dense light field images of large outdoor scenes, which can enhance 3D scene reconstruction and immersive VR/AR experiences.


Think Twice Before Assure: Confidence Estimation for Large Language Models through Reflection on Multiple Answers

http://arxiv.org/abs/2403.09972v1

Compressor summary: The paper proposes a two-step framework to better estimate the confidence of multiple answers generated by large language models, which can improve their trustability.


Prediction of Vessel Arrival Time to Pilotage Area Using Multi-Data Fusion and Deep Learning

http://arxiv.org/abs/2403.09969v1

Compressor summary: The paper proposes a TCN model that fuses AIS, pilotage booking, and meteorological data to predict vessel arrival time with high accuracy and low error.


Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction

http://arxiv.org/abs/2403.09963v1

Compressor summary: This paper investigates prompt bias in pre-trained language models, shows its negative impact on benchmark accuracy, and proposes a representation-based approach to mitigate it during inference time.


ViTCN: Vision Transformer Contrastive Network For Reasoning

http://arxiv.org/abs/2403.09962v1

Compressor summary: The paper introduces a new model that uses vision transformers to improve machine abstract reasoning abilities on the Raven dataset, which mimics human reasoning tests.


Online GNN Evaluation Under Test-time Graph Distribution Shifts

http://arxiv.org/abs/2403.09953v1

Compressor summary: This paper introduces LeBeD, a metric to evaluate how well trained graph neural networks (GNNs) can generalize to real-world graphs with distribution shifts, by measuring learning behavior discrepancies in node prediction and structure reconstruction.


RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

http://arxiv.org/abs/2403.09948v1

Compressor summary: RadCLIP is a new AI model that uses language-image pre-training to improve medical image analysis by understanding radiological data better than existing models.


Shifting Focus: From Global Semantics to Local Prominent Features in Swin-Transformer for Knee Osteoarthritis Severity Assessment

http://arxiv.org/abs/2403.09947v1

Compressor summary: The authors propose a deep learning model that enhances local features for knee osteoarthritis grade classification using the Swin Transformer, improving accuracy and robustness in medical imaging diagnostics.


Quantization Effects on Neural Networks Perception: How would quantization change the perceptual field of vision models?

http://arxiv.org/abs/2403.09939v1

Compressor summary: This study examines how quantizing neural networks affects their perceptual fields, particularly class activation maps (CAMs), across six different CNN architectures, shedding light on the alignment between CAMs and human visual saliency maps and revealing the sensitivities of different models to quantization.


Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics

http://arxiv.org/abs/2403.09930v1

Compressor summary: QDAC is an advanced deep reinforcement learning algorithm that learns diverse and high-performing skills for adapting to complex situations.