arxiv compressed, 2024-04-02

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-04-02 generated by the compressor, my personal LLM-based project.


Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

http://arxiv.org/abs/2403.20331v1

Compressor summary: The paper proposes Unsolvable Problem Detection (UPD) to test Vision Language Models' ability to handle unanswerable questions in VQA tasks and explores various solutions to improve their performance.


Are We on the Right Way for Evaluating Large Vision-Language Models?

http://arxiv.org/abs/2403.20330v1

Compressor summary: MMStar is a new benchmark for evaluating large vision-language models that requires visual content and avoids unintentional data leakage, addressing issues in current multi-modal evaluations.


ReALM: Reference Resolution As Language Modeling

http://arxiv.org/abs/2403.20329v1

Compressor summary: The paper shows how large language models can be used to improve reference resolution for different types of entities, including on-screen ones, and achieve performance comparable to or better than GPT-4.


Gecko: Versatile Text Embeddings Distilled from Large Language Models

http://arxiv.org/abs/2403.20327v1

Compressor summary: Gecko is a compact text embedding model that uses distilled knowledge from large language models to achieve strong retrieval performance.


Localising the Seizure Onset Zone from Single-Pulse Electrical Stimulation Responses with a Transformer

http://arxiv.org/abs/2403.20324v1

Compressor summary: The paper presents Transformer models with cross-channel attention for localizing the epileptogenic focus using electrical stimulation responses, achieving better results than previous methods and handling different electrode placements and patient variability.


Towards a Framework for Evaluating Explanations in Automated Fact Verification

http://arxiv.org/abs/2403.20322v1

Compressor summary: The paper proposes a framework for systematically evaluating rationalizing explanations in NLP, with examples from automated fact verification.


MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning

http://arxiv.org/abs/2403.20320v1

Compressor summary: MTLoRA is a novel framework for efficient fine-tuning of multi-task learning models that achieves better accuracy and efficiency than existing methods while reducing the number of trainable parameters by 3.6x.


SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects

http://arxiv.org/abs/2403.20318v1

Compressor summary: The paper investigates the problem of monocular 3D detectors generalizing to large objects and proposes SeaBird, a method that uses segmentation in bird's view with dice loss for better noise-robustness.


Convolutional Prompting meets Language Models for Continual Learning

http://arxiv.org/abs/2403.20317v1

Compressor summary: ConvPrompt is a novel convolutional prompt creation mechanism for continuous learning that uses layer-specific embeddings, generates text descriptions for each category, and adapts the number of prompts based on task similarity, improving performance without increasing parameter overhead.


Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations

http://arxiv.org/abs/2403.20312v1

Compressor summary: This paper introduces CoN-CLIP, a framework that improves vision-language models' understanding of negations by using CC-Neg, a new dataset, and modifying CLIP's contrastive loss, leading to better zero-shot image classification and compositionality performance.


InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds

http://arxiv.org/abs/2403.20309v1

Compressor summary: InstantSplat is a framework that combines point-based representations with dense stereo models to quickly estimate camera intrinsics and extrinsics from sparse-view images, improving novel view synthesis performance.


ChainNet: Structured Metaphor and Metonymy in WordNet

http://arxiv.org/abs/2403.20308v1

Compressor summary: ChainNet is a lexical resource that captures how senses of words are related by metaphor or metonymy in the Open English Wordnet.


Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

http://arxiv.org/abs/2403.20306v1

Compressor summary: The paper discusses how to optimize energy usage and performance of large language models in data centers by exploring trade-offs between various parameters.


Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain

http://arxiv.org/abs/2403.20288v1

Compressor summary: Large Language Models can assist and correct physicians in medical decision-making tasks by interacting effectively with them, depending on prompt design and accuracy.


Benchmarking Counterfactual Image Generation

http://arxiv.org/abs/2403.20287v1

Compressor summary: The paper introduces a framework for evaluating counterfactual image generation methods using various metrics and provides a Python package to benchmark different approaches.


LayerNorm: A key component in parameter-efficient fine-tuning

http://arxiv.org/abs/2403.20284v1

Compressor summary: The paper analyzes BERT components, finds output LayerNorm is crucial for fine-tuning, and shows that fine-tuning a small part of it can achieve comparable or better results than full fine-tuning.


Sparse multimodal fusion with modal channel attention

http://arxiv.org/abs/2403.20280v1

Compressor summary: The paper explores how masked multimodal transformers can learn robust embeddings when modalities are sparse and proposes a new attention mechanism (MCA) that improves embedding quality and task performance.


LUQ: Long-text Uncertainty Quantification for LLMs

http://arxiv.org/abs/2403.20279v1

Compressor summary: Extsc{Luq} is a new UQ method for long text generation that helps identify and reduce nonfactual outputs in large language models.


Snap-it, Tap-it, Splat-it: Tactile-Informed 3D Gaussian Splatting for Reconstructing Challenging Surfaces

http://arxiv.org/abs/2403.20275v1

Compressor summary: Tactile-Informed 3DGS is a new method that uses touch data and vision to create more accurate and smoother 3D object models, especially for non-Lambertian surfaces like shiny or reflective ones.


CATSNet: a context-aware network for Height Estimation in a Forested Area based on Pol-TomoSAR data

http://arxiv.org/abs/2403.20273v1

Compressor summary: CATSNet is a context-aware deep learning method that uses convolutional neural networks to estimate forest and ground heights from TomoSAR data, outperforming existing techniques by leveraging patch-based information and context within MB TomoSAR data.


Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

http://arxiv.org/abs/2403.20271v1

Compressor summary: The paper introduces a new model and dataset for visual prompting with artificial intelligence, improving the ability of multimodal language models to understand images and follow instructions.


Latxa: An Open Language Model and Evaluation Suite for Basque

http://arxiv.org/abs/2403.20266v1

Compressor summary: Latxa is a large language model family for Basque with new pretraining and evaluation datasets, improving Basque LLM performance.


ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models

http://arxiv.org/abs/2403.20262v1

Compressor summary: ELITR-Bench is a new benchmark for long-context LLMs focused on a meeting assistant scenario, revealing gaps between open-source and proprietary models and limitations of GPT-4's evaluation method.


Prototype-based Interpretable Breast Cancer Prediction Models: Analysis and Challenges

http://arxiv.org/abs/2403.20260v1

Compressor summary: The paper proposes a framework to evaluate the quality of interpretable prototype-based models for breast cancer prediction using mammography, finding that while they perform well compared to black-box models, prototype quality still needs improvement.


Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions

http://arxiv.org/abs/2403.20254v1

Compressor summary: This paper evaluates the robustness of temporal action detection methods to frame corruptions, builds two benchmarks, and proposes a simple but effective method to improve robustness using FrameDrop augmentation and Temporal-Robust Consistency loss.


MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

http://arxiv.org/abs/2403.20253v1

Compressor summary: The paper presents MedCLIP-SAM, a novel framework that uses CLIP and SAM models to generate segmentation of medical images using text prompts in zero-shot and weakly supervised settings, improving data efficiency and generalizability.


Using LLMs to Model the Beliefs and Preferences of Targeted Populations

http://arxiv.org/abs/2403.20252v1

Compressor summary: The text discusses using large language models to align with human population preferences for various applications and evaluates different fine-tuning approaches and a new loss term for this purpose.


Relation Rectification in Diffusion Model

http://arxiv.org/abs/2403.20249v1

Compressor summary: The paper proposes Relation Rectification, a method that uses Heterogeneous Graph Convolutional Networks to adjust text embeddings and improve the visual representation of relationships between objects in text-to-image diffusion models.


Enhancing Dimension-Reduced Scatter Plots with Class and Feature Centroids

http://arxiv.org/abs/2403.20246v1

Compressor summary: The study proposes a method to increase interpretability of dimension-reduced biomedical data by overlaying class and feature centroids on scatter plots.


Long-Tailed Anomaly Detection with Learnable Class Names

http://arxiv.org/abs/2403.20236v1

Compressor summary: LTAD is a novel method that detects defects in images from multiple and long-tailed classes without relying on class names, using reconstruction and semantic modules.


Artificial Neural Networks-based Real-time Classification of ENG Signals for Implanted Nerve Interfaces

http://arxiv.org/abs/2403.20234v1

Compressor summary: The article explores the use of artificial neural networks to classify motor/sensory stimuli from electroneurographic signals in implanted nerve interfaces for neuropathy recovery.


U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation

http://arxiv.org/abs/2403.20231v1

Compressor summary: The study proposes a new method for fine-grained visual appearance personalization that uses user-provided sentences and a decoupled self-augmentation strategy to learn target attributes and improve controllability and flexibility.


Graph Neural Aggregation-diffusion with Metastability

http://arxiv.org/abs/2403.20221v1

Compressor summary: GRADE is a novel graph neural network model that uses nonlinear diffusion and aggregation to avoid over-smoothing and create node clusters.


Advancing the Arabic WordNet: Elevating Content Quality

http://arxiv.org/abs/2403.20215v1

Compressor summary: The paper introduces a revised Arabic WordNet that improves its quality and covers multiple aspects of lexico-semantic resources.


H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model

http://arxiv.org/abs/2403.20213v1

Compressor summary: The authors developed a helpful and honest remote sensing vision-language model (H2RSVLM) that improves spatial perception and answers only answerable questions using two new datasets, HqDC-1.4M and RSSA.


On Size and Hardness Generalization in Unsupervised Learning for the Travelling Salesman Problem

http://arxiv.org/abs/2403.20212v1

Compressor summary: The paper investigates how various factors affect the performance and generalization of unsupervised learning methods for solving the Travelling Salesman Problem (TSP) using a graph neural network and a heat map approach.


Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

http://arxiv.org/abs/2403.20208v1

Compressor summary: This research trains a large language model on a corpus of tables with annotations and shows its effectiveness in solving classification, regression, and imputation tasks for tabular data.


The Future of Combating Rumors? Retrieval, Discrimination, and Generation

http://arxiv.org/abs/2403.20204v1

Compressor summary: The text proposes a comprehensive debunking process using AI that detects rumors and provides explanations to refute misinformation, ensuring high credibility assessment and relevant knowledge retrieval.


Automatic Alignment of Discourse Relations of Different Discourse Annotation Frameworks

http://arxiv.org/abs/2403.20196v1

Compressor summary: The paper presents a fully automatic method to map discourse relations from different frameworks using label embeddings learned by contrastive learning.


Enhancing Lithological Mapping with Spatially Constrained Bayesian Network (SCB-Net): An Approach for Field Data-Constrained Predictions with Uncertainty Evaluation

http://arxiv.org/abs/2403.20195v1

Compressor summary: The Spatially Constrained Bayesian Network (SCB-Net) is a new architecture that effectively uses auxiliary data and learns from spatial patterns to create reliable geological maps with uncertainty assessment.


Motion Inversion for Video Customization

http://arxiv.org/abs/2403.20193v1

Compressor summary: The paper proposes Motion Embeddings, a new way to represent and manipulate motion in videos using one-dimensional vectors that work well with video diffusion models.


Sketch-to-Architecture: Generative AI-aided Architectural Design

http://arxiv.org/abs/2403.20186v1

Compressor summary: The text describes a novel workflow that uses generative AI to create floorplans and 3D models from sketches, enabling faster architectural design based on textual descriptions.


HARMamba: Efficient Wearable Sensor Human Activity Recognition Based on Bidirectional Selective SSM

http://arxiv.org/abs/2403.20183v1

Compressor summary: HARMamba is a lightweight selective state space model for real-time wearable sensor activity recognition that outperforms Transformer-based models while reducing computational and memory overhead.


Measuring Taiwanese Mandarin Language Understanding

http://arxiv.org/abs/2403.20180v1

Compressor summary: This paper introduces TMLU, a new evaluation suite for Chinese language models, especially Taiwanese Mandarin, that covers various subjects and assesses their knowledge and reasoning skills with explanations.


Artificial consciousness. Some logical and conceptual preliminaries

http://arxiv.org/abs/2403.20177v1

Compressor summary: The text discusses the theoretical and empirical possibility of artificial consciousness, suggesting that dimensions and profiles of consciousness should be used for a balanced discussion, and outlines a research strategy for realizing "awareness" in artificial systems.


HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes

http://arxiv.org/abs/2403.20159v1

Compressor summary: The paper proposes HGS-Mapping, a fast and accurate online dense mapping framework for urban scenes using Hybrid Gaussian Representation, which models different parts of the scene with Gaussians with distinct properties.


ChatGPT v.s. Media Bias: A Comparative Study of GPT-3.5 and Fine-tuned Language Models

http://arxiv.org/abs/2403.20158v1

Compressor summary: The study compares ChatGPT's ability to detect different types of media bias against fine-tuned models, showing mixed results.


A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

http://arxiv.org/abs/2403.20157v1

Compressor summary: Subword methods improve machine translation for low-resource languages, but their effectiveness depends on orthographic word boundaries and fine-tuning methods.


Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior

http://arxiv.org/abs/2403.20153v1

Compressor summary: The paper introduces Talk3D, a framework that uses 3D-aware generative prior to synthesize realistic talking head animations from audio inputs, achieving better performance than existing methods.


A Learning-based Incentive Mechanism for Mobile AIGC Service in Decentralized Internet of Vehicles

http://arxiv.org/abs/2403.20151v1

Compressor summary: This paper proposes a decentralized incentive mechanism using multi-agent deep reinforcement learning for allocating AI-generated content on roadside units in the Internet of Vehicles, improving user experience and reducing latency.


TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods

http://arxiv.org/abs/2403.20150v1

Compressor summary: TFB is an automated benchmark for comparing time series forecasting methods across diverse domains, datasets, and evaluation strategies.


Conformal Prediction for Stochastic Decision-Making of PV Power in Electricity Markets

http://arxiv.org/abs/2403.20149v1

Compressor summary: The paper investigates how conformal prediction, a probabilistic forecasting method, improves photovoltaic power predictions for electricity markets using different bidding strategies and shows that it outperforms linear methods in profit and energy balance.


IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context

http://arxiv.org/abs/2403.20147v1

Compressor summary: IndiBias is a new dataset for evaluating social biases in Indian context and languages, covering various dimensions and intersectionality, based on existing resources and LLMs' inputs.


Fine-tuning Large Language Models for Automated Diagnostic Screening Summaries

http://arxiv.org/abs/2403.20145v1

Compressor summary: The authors evaluate state-of-the-art language models for generating summaries from mental health examinations, finding that their fine-tuned model performs well and could potentially improve support in developing countries.


StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation

http://arxiv.org/abs/2403.20142v1

Compressor summary: StegoGAN is a novel GAN-based model that uses steganography to prevent spurious features in non-bijective image translation tasks, enhancing semantic consistency without extra supervision.


Accurate Block Quantization in LLMs with Outliers

http://arxiv.org/abs/2403.20137v1

Compressor summary: The paper proposes a novel method to improve quantization accuracy using low precision BFP formats by rearranging outliers in weights and activations, reducing memory footprint without compromising model accuracy.


User Modeling Challenges in Interactive AI Assistant Systems

http://arxiv.org/abs/2403.20134v1

Compressor summary: The text discusses how AI assistants can better understand users' mental states to provide more personalized guidance during tasks using large language models.


The Impact of Prompts on Zero-Shot Detection of AI-Generated Text

http://arxiv.org/abs/2403.20127v1

Compressor summary: This paper examines how prompts affect the accuracy of zero-shot detectors in identifying AI-generated texts and proposes a framework to evaluate them.


ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning

http://arxiv.org/abs/2403.20126v1

Compressor summary: The paper presents ECLIPSE, a novel method for continual panoptic segmentation that fine-tunes prompt embeddings and uses logit manipulation to address catastrophic forgetting and plasticity, achieving state-of-the-art results.


Application of Machine Learning Algorithms in Classifying Postoperative Success in Metabolic Bariatric Surgery: A Comprehensive Study

http://arxiv.org/abs/2403.20124v1

Compressor summary: The study proposes a novel machine learning approach to classify patients undergoing metabolic bariatric surgery, showing that enhanced KNN and Decision Tree models with oversampling techniques can achieve high accuracy in predicting patient outcomes.


Learning using granularity statistical invariants for classification

http://arxiv.org/abs/2403.20122v1

Compressor summary: LUGSI is a new learning paradigm that improves classification performance and speed on large-scale datasets by using granularity statistical invariants to enhance structural information and reduce computational cost.


Segmentation, Classification and Interpretation of Breast Cancer Medical Images using Human-in-the-Loop Machine Learning

http://arxiv.org/abs/2403.20112v1

Compressor summary: The paper proposes a doctor-in-the-loop method to improve machine learning models for breast cancer analysis using genomic data and image analysis, but finds that it is not always effective due to the complexity of the domain.


Mol-AIR: Molecular Reinforcement Learning with Adaptive Intrinsic Rewards for Goal-directed Molecular Generation

http://arxiv.org/abs/2403.20109v1

Compressor summary: Mol-AIR is a reinforcement learning framework for generating molecules with desired properties using adaptive intrinsic rewards and random distillation network, improving drug discovery efficiency.


Aggregating Local and Global Features via Selective State Spaces Model for Efficient Image Deblurring

http://arxiv.org/abs/2403.20106v1

Compressor summary: The paper proposes an efficient image deblurring network using selective structured state spaces and aggregate local and global blocks to balance between accuracy and efficiency.


FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

http://arxiv.org/abs/2403.20105v1

Compressor summary: The text describes a zero-shot, training-free method for image segmentation using foundation models like CLIP and diffusion models, which achieves competitive results compared to weakly-supervised approaches.


NLP for Counterspeech against Hate: A Survey and How-To Guide

http://arxiv.org/abs/2403.20103v1

Compressor summary: The paper provides a guide for NLP researchers to conduct counterspeech research against online hate by describing steps, best practices, and open challenges.


RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

http://arxiv.org/abs/2403.20101v1

Compressor summary: RealKIE is a new benchmark with five diverse datasets for key information extraction research, focusing on enterprise applications and addressing real-world challenges like text serialization, sparse annotations, and complex tables.


ITCMA: A Generative Agent Based on a Computational Consciousness Structure

http://arxiv.org/abs/2403.20097v1

Compressor summary: The paper introduces a computational consciousness structure called ITCM and an agent (ITCMA) that enhances LLMs' understanding of implicit instructions and common-sense knowledge for better performance in open-world settings, including real-world tasks with robots.


Modeling Weather Uncertainty for Multi-weather Co-Presence Estimation

http://arxiv.org/abs/2403.20092v1

Compressor summary: Key points: - The paper proposes a new method to handle multiple weather conditions in outdoor scenes for computer vision tasks - The method models weather uncertainty using a Gaussian mixture model and prior-posterior learning - A new dataset (MePe) is introduced to benchmark the proposed method and other weather classification tasks - The method achieves state-of-the-art performance and generalization capabilities Summary: The paper presents a novel approach to handle multiple weather conditions in outdoor scenes using a Gaussian mixture model and prior-posterior learning, along with a new dataset and benchmarks.


Implications of the AI Act for Non-Discrimination Law and Algorithmic Fairness

http://arxiv.org/abs/2403.20089v1

Compressor summary: The text discusses fairness in AI from a European law perspective, highlighting the need for bridging algorithmic fairness and non-discrimination law through the AI Act, which may affect bias detection and correction strategies.


An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

http://arxiv.org/abs/2403.20088v1

Compressor summary: The authors study how different languages influence the performance of pre-trained multilingual models on various target languages, using adapter units to disentangle language effects and provide a list of recommended transfer configurations.


IPA Transcription of Bengali Texts

http://arxiv.org/abs/2403.20084v1

Compressor summary: The text discusses the need for a standardized IPA system for Bengali pronunciation and presents a new framework that includes a novel dataset and deep learning-based benchmarks.


Mixed-precision Supernet Training from Vision Foundation Models using Low Rank Adapter

http://arxiv.org/abs/2403.20080v1

Compressor summary: The paper presents a method to optimize and compress large vision models using mixed-precision search and memory-efficient training, achieving significant BitOPs reduction without sacrificing performance.


SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior

http://arxiv.org/abs/2403.20079v1

Compressor summary: The text proposes a new approach that improves neural rendering for street scenes by combining a diffusion model and multi-modal data to handle deviations from training viewpoints.


Negative Label Guided OOD Detection with Pretrained Vision-Language Models

http://arxiv.org/abs/2403.20078v1

Compressor summary: NegLabel is a novel OOD detection method for vision-language models that uses negative labels from corpus databases and achieves state-of-the-art performance on various benchmarks and domains.


Cross-Lingual Transfer Robustness to Lower-Resource Languages on Adversarial Datasets

http://arxiv.org/abs/2403.20056v1

Compressor summary: The study examines how well multilingual language models perform in recognizing named entities across different languages, finding that transfer ability depends on shared entity chunks and robustness to input perturbations.


Embracing Unknown Step by Step: Towards Reliable Sparse Training in Real World

http://arxiv.org/abs/2403.20047v1

Compressor summary: Sparse training can make deep neural networks unreliable in detecting out-of-distribution data, but a new method improves their performance and reliability without increasing costs or requiring extra data.


Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning

http://arxiv.org/abs/2403.20046v1

Compressor summary: The study explores how large language models can learn from their mistakes in reasoning tasks using new benchmark CoTErrorSet and two methods: self-rethinking prompting and mistake tuning.


Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

http://arxiv.org/abs/2403.20041v1

Compressor summary: The authors propose four optimization techniques to improve LLM deployment on mobile devices and achieve significant speedups in inference tasks compared to existing methods.


NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising

http://arxiv.org/abs/2403.20034v1

Compressor summary: NeSLAM is a framework that improves 3D reconstruction and camera tracking in RGB-D SLAM systems using NeRF, dense depth estimation, SDF scene representation, and self-supervised feature tracking.


HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes

http://arxiv.org/abs/2403.20032v1

Compressor summary: HO-Gaussian is a hybrid method that improves neural rendering of urban scenes by combining 3D Gaussian Splatting with grid-based volume and view-dependent color representation, overcoming previous limitations and enabling real-time photo-realistic results on multi-camera datasets.


A Unified Framework for Human-centric Point Cloud Video Understanding

http://arxiv.org/abs/2403.20031v1

Compressor summary: Key points: - Human-centric PVU is a field that extracts and interprets human features from point clouds for various applications. - Previous works rely on huge labeled data, which has poor generalization capability. - The paper proposes a unified framework that uses prior knowledge and inherent features to improve performance on action recognition and 3D pose estimation. Summary: The paper presents a novel framework for human-centric point cloud video understanding that leverages prior knowledge and data features to achieve state-of-the-art results on action recognition and 3D pose estimation tasks.


FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues

http://arxiv.org/abs/2403.20026v1

Compressor summary: The Feature Swapping Multi-modal Reasoning (FSMR) model enhances textual and visual understanding by exchanging features between images and words, using a pre-trained visual-language encoder and a multi-modal cross-attention mechanism.


Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity

http://arxiv.org/abs/2403.20022v1

Compressor summary: Psychometry is an omnifit model that uses fMRI data from different subjects to reconstruct images by capturing inter-subject commonalities and individual differences, enhancing the representation with subject-specific memories.


Adverb Is the Key: Simple Text Data Augmentation with Adverb Deletion

http://arxiv.org/abs/2403.20015v1

Compressor summary: The paper presents a text data augmentation method that deletes adverbs to preserve semantics while being efficient and effective for various tasks.


DerainNeRF: 3D Scene Estimation with Adhesive Waterdrop Removal

http://arxiv.org/abs/2403.20013v1

Compressor summary: Key points: - The paper proposes a method to remove waterdrops from multi-view images degraded by weather conditions - The method uses an attention network and a Neural Radiance Field model - The method can generate clear 3D scenes and high-quality novel-view images with waterdrops removed - The method outperforms existing SOTA methods for image adhesive waterdrop removal Summary: The paper presents a novel method that uses an attention network and a Neural Radiance Field model to remove waterdrops from multi-view images and generate clear 3D scenes and high-quality novel-view images, surpassing existing SOTA methods.


Colorful Cutout: Enhancing Image Data Augmentation with Curriculum Learning

http://arxiv.org/abs/2403.20012v1

Compressor summary: The study introduces colorful cutout, a curriculum data augmentation technique for images that gradually increases noise and difficulty, improving generalization and performance.


On Large Language Models' Hallucination with Regard to Known Facts

http://arxiv.org/abs/2403.20009v1

Compressor summary: The text investigates how large language models can know correct answers but still hallucinate, using inference dynamics analysis and a classifier based on output token probabilities.


Large Language Model based Situational Dialogues for Second Language Learning

http://arxiv.org/abs/2403.20005v1

Compressor summary: The paper proposes situational dialogue models for second language learners to practice conversational skills using large language models, which can handle various topics and are evaluated automatically.


Grounding and Enhancing Grid-based Models for Neural Fields

http://arxiv.org/abs/2403.20002v1

Compressor summary: The paper introduces a theoretical framework for analyzing grid-based neural field models and develops a novel model, MulFAGrid, which outperforms existing models in various tasks.


DeepHeteroIoT: Deep Local and Global Learning over Heterogeneous IoT Sensor Data

http://arxiv.org/abs/2403.19996v1

Compressor summary: Key points: - IoT sensor data has heterogeneity (different timestamps, frequencies, locations, units) - Traditional time series classification algorithms struggle with this heterogeneity - Proposed deep learning model combines CNN and BGRU to learn local and global features - Model outperforms state-of-the-art methods and baselines Summary: The paper proposes a novel deep learning model that learns both local and global features from heterogeneous IoT sensor data, achieving better classification results than existing methods.


Development of Compositionality and Generalization through Interactive Learning of Language and Action of Robots

http://arxiv.org/abs/2403.19995v1

Compressor summary: Key points: - The text discusses how humans can apply learned behavior to unlearned situations using compositionality, a skill that combines language and action. - The authors propose a neural network model that integrates vision, proprioception, and language to learn this skill in robots. - The results show that increasing task variations improves generalization and that visual attention and working memory are crucial for achieving linguistic goals. Summary: The text presents a brain-inspired model that teaches robots compositionality, the ability to reuse language and action parts in new situations, and shows how it benefits from increased task variation and visual attention.


MindArm: Mechanized Intelligent Non-Invasive Neuro-Driven Prosthetic Arm System

http://arxiv.org/abs/2403.19992v1

Compressor summary: MindArm is a low-cost, non-invasive neuro-driven prosthetic arm system that uses EEG electrodes and a deep neural network to translate brain signals into prosthetic arm motions.


Stable Surface Regularization for Fast Few-Shot NeRF

http://arxiv.org/abs/2403.19985v1

Compressor summary: The paper introduces an algorithm that uses a new surface regularization technique called ASDF to generate high-quality novel views from few input images, making it faster and more stable than existing methods.


A Parallel Attention Network for Cattle Face Recognition

http://arxiv.org/abs/2403.19980v1

Compressor summary: Key points: - Cattle face recognition is important for animal husbandry and behavioral research - New large-scale dataset ICRWE created for wild environments - Novel parallel attention network (PANet) introduced for cattle face recognition - PANet achieves 88.03% accuracy on ICRWE, the state-of-the-art approach Summary: The paper presents a new large-scale dataset and a novel network for recognizing cattle faces in wild environments, achieving the highest accuracy so far.


Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer

http://arxiv.org/abs/2403.19979v1

Compressor summary: The paper proposes a method for continuous learning that improves on previous approaches by using adapter tuning and feature sampling without expanding the model or retaining old samples.


eTraM: Event-based Traffic Monitoring Dataset

http://arxiv.org/abs/2403.19976v1

Compressor summary: Event cameras can be used for static traffic monitoring with high performance, as shown by the novel eTraM dataset and its evaluation on various scenarios and models.


Context-Aware Integration of Language and Visual References for Natural Language Tracking

http://arxiv.org/abs/2403.19975v1

Compressor summary: TNL aims to track a target in a video using language, but existing methods have issues with drift and ambiguity; the proposed joint multi-modal framework improves accuracy and consistency by leveraging visual and linguistic cues and decoding them together.


Separate, Dynamic and Differentiable (SMART) Pruner for Block/Output Channel Pruning on Computer Vision Tasks

http://arxiv.org/abs/2403.19969v1

Compressor summary: The SMART pruner is a new technique for improving the accuracy of DNN pruning by using a learnable probability mask, differentiable Top k operator, and dynamic temperature parameter in a dynamic, differentiable, and adaptable way.


Rewrite the Stars

http://arxiv.org/abs/2403.19967v1

Compressor summary: The paper explores how element-wise multiplication (star operation) can create non-linear features in neural networks without increasing size and introduces StarNet, a prototype that shows promising results.


FairRAG: Fair Human Generation via Fair Retrieval Augmentation

http://arxiv.org/abs/2403.19964v1

Compressor summary: FairRAG is a framework that uses external images to improve fairness and diversity in text-to-image generation models by applying debiasing strategies.


Efficient Modulation for Vision Networks

http://arxiv.org/abs/2403.19963v1

Compressor summary: The text introduces a novel design for efficient vision networks called EfficientMod, which improves accuracy and efficiency by using a modulation mechanism with an MLP block.


Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning

http://arxiv.org/abs/2403.19962v1

Compressor summary: The authors propose a method to improve LLMs' abilities as intelligent agents by using GPT-4-constructed data, supervised fine-tuning, and techniques like multi-path reasoning and task decomposition.


Coverage-Guaranteed Prediction Sets for Out-of-Distribution Data

http://arxiv.org/abs/2403.19950v1

Compressor summary: The paper proposes a method for making confident predictions in out-of-distribution settings by modifying split conformal prediction and proves its validity.


FairCLIP: Harnessing Fairness in Vision-Language Learning

http://arxiv.org/abs/2403.19949v1

Compressor summary: Key points: - FairVLMed is a new medical vision-language dataset for studying fairness in VL models - CLIP and BLIP2, two widely-used VL models, show significant biases across four protected attributes - FairCLIP is a proposed method to reduce these biases using optimal transport Summary: The paper introduces FairVLMed, the first dataset for fairness analysis of medical vision-language models. It shows that CLIP and BLIP2 are biased towards certain subgroups and proposes FairCLIP to mitigate these biases.


Binarized Low-light Raw Video Enhancement

http://arxiv.org/abs/2403.19944v1

Compressor summary: The paper explores using extremely compact binary neural networks for low-light raw video enhancement, addressing issues of temporal information fusion and performance gap between binary and full precision convolutions.


TDANet: A Novel Temporal Denoise Convolutional Neural Network With Attention for Fault Diagnosis

http://arxiv.org/abs/2403.19943v1

Compressor summary: TDANet is a novel Deep Learning technique for fault diagnosis in noisy industrial environments, using multi-scale 2D convolutions and attention modules to enhance signal features and achieve high diagnostic accuracy.


Diverse Feature Learning by Self-distillation and Reset

http://arxiv.org/abs/2403.19941v1

Compressor summary: The paper proposes Diverse Feature Learning (DFL), which combines self-distillation and reset to help models preserve important features and learn new ones for image classification.


SLFNet: Generating Semantic Logic Forms from Natural Language Using Semantic Probability Graphs

http://arxiv.org/abs/2403.19936v1

Compressor summary: SLFNet is a new neural network for natural language interfaces that uses syntactic information, semantic probability graphs, and Multi-Head SLF Attention to convert user commands into structured forms.


CP HDR: A feature point detection and description library for LDR and HDR images

http://arxiv.org/abs/2403.19935v1

Compressor summary: The authors review feature point detection and description methods for high dynamic range images and propose two modified algorithms (SIFT for HDR and Harris for HDR) that improve performance in computer vision tasks.


Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

http://arxiv.org/abs/2403.19930v1

Compressor summary: The paper studies how to fine-tune large language models for a specific natural language understanding task in supervised settings, focusing on Chinese short text matching and examining different factors that affect performance.


DiJiang: Efficient Large Language Models through Compact Kernelization

http://arxiv.org/abs/2403.19928v1

Compressor summary: The paper introduces DiJiang, a linear complexity model for Transformers that reduces training costs and inference speeds using frequency domain kernelization and Discrete Cosine Transform.


Video-Based Human Pose Regression via Decoupled Space-Time Aggregation

http://arxiv.org/abs/2403.19926v1

Compressor summary: The paper presents a novel video-based human pose regression method that efficiently captures spatial and temporal dependencies using a Decoupled Space-Time Aggregation network (DSTA) without relying on intermediate heatmaps.


Decision Mamba: Reinforcement Learning via Sequence Modeling with Selective State Spaces

http://arxiv.org/abs/2403.19925v1

Compressor summary: The paper explores combining Decision Transformer and Mamba frameworks to improve sequential decision-making in reinforcement learning tasks by enhancing sequence modeling efficiency and effectiveness.


SceneTracker: Long-term Scene Flow Estimation Network

http://arxiv.org/abs/2403.19924v1

Compressor summary: The paper proposes a novel network called SceneTracker that can estimate long-term 3D motion in scenes by using appearance and depth correlation features and the Transformer architecture.


MI-NeRF: Learning a Single Face NeRF from Multiple Identities

http://arxiv.org/abs/2403.19920v1

Compressor summary: MI-NeRF is a single neural network that learns non-rigid facial motion for multiple identities from monocular videos, using a multiplicative module to capture identity and non-identity information interactions.


Diff-Reg v1: Diffusion Matching Model for Registration Problem

http://arxiv.org/abs/2403.19919v1

Compressor summary: The diffusion matching model uses a denoising process to create robust correspondences for registration tasks, addressing challenges like large deformation and scale inconsistency in complex scenarios.


MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models

http://arxiv.org/abs/2403.19913v1

Compressor summary: MANGO is a benchmark for evaluating large language models' mapping and navigation skills using text-based mazes and questions.


Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method

http://arxiv.org/abs/2403.19912v1

Compressor summary: The method uses machine learning to identify and segment radio waves from space in 3D data, with high accuracy and recall rates.


Beyond the Known: Novel Class Discovery for Open-world Graph Learning

http://arxiv.org/abs/2403.19907v1

Compressor summary: The paper introduces ORAL, a novel method for discovering novel classes on graphs using semi-supervised learning and multi-scale features.


Classification of Diabetic Retinopathy using Pre-Trained Deep Learning Models

http://arxiv.org/abs/2403.19905v1

Compressor summary: Key points: - The paper presents a CAD system for automatic classification of retinal images into five DR classes using CNNs and fine-tuning techniques. - The model is trained on fundus images with different resolutions and tested on the Kaggle platform. - The achieved AUC values are reported for various deep learning models. Summary: The paper introduces a CAD system that uses CNNs and fine-tuning to classify retinal images into five DR classes, achieving different AUC values for various models.


Fully Geometric Panoramic Localization

http://arxiv.org/abs/2403.19904v1

Compressor summary: The paper presents a novel localization method using only 2D-3D line geometry, avoiding visual descriptors and achieving fast and accurate results for challenging scenes.


Heterogeneous Network Based Contrastive Learning Method for PolSAR Land Cover Classification

http://arxiv.org/abs/2403.19902v1

Compressor summary: HCLNet is a novel method that uses heterogeneous architecture and contrastive learning to improve PolSAR image classification with few-shot learning and multi-features, addressing the challenges of labeled data scarcity and scattering confusion.


Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

http://arxiv.org/abs/2403.19898v1

Compressor summary: The paper proposes a new model called StrDiffusion that guides image inpainting with structure-based semantics to reduce semantic discrepancy between masked and unmasked regions.


Disentangling Racial Phenotypes: Fine-Grained Control of Race-related Facial Phenotype Characteristics

http://arxiv.org/abs/2403.19897v1

Compressor summary: The paper presents a novel GAN framework that enables fine-grained control over race-related facial features in 2D images using a new dataset and preserving facial identity for racial bias mitigation.


Nonlinearity Enhanced Adaptive Activation Function

http://arxiv.org/abs/2403.19896v1

Compressor summary: The paper introduces an improved activation function for neural networks, which can increase accuracy without much extra computation, but has some tradeoffs and requires adjustable parameters.


PLoc: A New Evaluation Criterion Based on Physical Location for Autonomous Driving Datasets

http://arxiv.org/abs/2403.19893v1

Compressor summary: This paper proposes PLoc, a novel evaluation criterion for object detection in autonomous driving based on physical location, and presents ApolloScape-R, a re-annotated dataset reflecting this criterion.


Towards a Robust Retrieval-Based Summarization System

http://arxiv.org/abs/2403.19889v1

Compressor summary: The paper proposes LogicSumm, a framework to evaluate large language models' robustness in summarization tasks, and SummRAG, a system to improve their performance by fine-tuning them on realistic scenarios.


MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

http://arxiv.org/abs/2403.19888v1

Compressor summary: MambaMixer combines selective token and channel mixing to improve scalability and performance in vision and time series tasks, outperforming existing models.