arxiv compressed, 2024-03-27

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-03-27 generated by the compressor, my personal LLM-based project.


One-Shot Domain Incremental Learning

http://arxiv.org/abs/2403.16707v1

Compressor summary: The paper proposes a new technique for domain incremental learning with deep neural networks when only one sample from a new domain is available, by addressing issues in batch normalization layers.


ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search

http://arxiv.org/abs/2403.16702v1

Compressor summary: This paper introduces ProCQA, a mixed-modal QA dataset from StackOverflow, and proposes a modality-agnostic contrastive pre-training method to improve code language models' text and code alignment for code question answering.


DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization

http://arxiv.org/abs/2403.16697v1

Compressor summary: Dynamic PromptStyler (DPStyler) is a model that improves source-free domain generalization by updating styles, removing style variations, and ensembling models.


ToXCL: A Unified Framework for Toxic Speech Detection and Explanation

http://arxiv.org/abs/2403.16685v1

Compressor summary: ToXCL is a framework that detects and explains implicit toxic speech in online posts, improving on previous models by generating targeted demographic groups and using knowledge distillation for better detection.


Symmetric Basis Convolutions for Learning Lagrangian Fluid Mechanics

http://arxiv.org/abs/2403.16680v1

Compressor summary: The paper presents a new method for continuous convolutions in fluid simulations using separable basis functions, particularly Fourier-based networks, which improve accuracy and stability over existing methods.


FOOL: Addressing the Downlink Bottleneck in Satellite Computing with Neural Feature Compression

http://arxiv.org/abs/2403.16677v1

Compressor summary: FOOL is a novel feature compression method for nanosatellites that reduces transfer costs by processing raw satellite images onboard, leveraging inter-tile dependencies and context, while maintaining high prediction performance and perceptual quality.


Who is bragging more online? A large scale analysis of bragging in social media

http://arxiv.org/abs/2403.16668v1

Compressor summary: This paper uses computational methods to study how common, changing, and influenced by demographics bragging is on Twitter in the U.S., and identifies different bragging topics related to user characteristics.


Deep Reinforcement Learning and Mean-Variance Strategies for Responsible Portfolio Optimization

http://arxiv.org/abs/2403.16667v1

Compressor summary: This paper explores using deep reinforcement learning to optimize investment portfolios while considering environmental, social, and governance (ESG) goals.


RU22Fact: Optimizing Evidence for Multilingual Explainable Fact-Checking on Russia-Ukraine Conflict

http://arxiv.org/abs/2403.16662v1

Compressor summary: The text proposes a method using a Large Language Model to retrieve and summarize evidence for explainable fact-checking systems, and introduces RU22Fact, a multilingual dataset on the Russia-Ukraine conflict in 2022 with claims, evidence, and explanations.


Graph Augmentation for Recommendation

http://arxiv.org/abs/2403.16656v1

Compressor summary: GraphAug is a framework that improves recommendation systems by generating denoised self-supervised signals and adapting contrastive view generation using graph information bottleneck regularization.


Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT

http://arxiv.org/abs/2403.16655v1

Compressor summary: The project analyzes and corrects text errors using advanced neural network-based language models, BART and MarianMT.


A Novel Loss Function-based Support Vector Machine for Binary Classification

http://arxiv.org/abs/2403.16654v1

Compressor summary: The paper proposes a new SVM loss function called $\ell_s$ that improves generalization by considering the degree of penalty for correctly classified samples within the margin, and presents a fast optimization algorithm ($\ell_s$-ADMM) to handle it.


CLHA: A Simple yet Effective Contrastive Learning Framework for Human Alignment

http://arxiv.org/abs/2403.16649v1

Compressor summary: The paper proposes a simple contrastive learning framework to align large language models with human preferences using a novel rescoring strategy and pairwise contrastive loss.


Clustering Propagation for Universal Medical Image Segmentation

http://arxiv.org/abs/2403.16646v1

Compressor summary: S2VNet is a universal framework that unifies automatic and interactive medical image segmentation using slice-to-volume propagation, achieving faster inference speeds and reduced memory consumption than existing 3D solutions.


AI-Generated Video Detection via Spatio-Temporal Anomaly Learning

http://arxiv.org/abs/2403.16638v1

Compressor summary: The text proposes an effective AI-generated video detection scheme using a two-branch spatio-temporal CNN with ResNet sub-detectors, and presents a large-scale dataset for benchmarking and evaluation.


V2X-PC: Vehicle-to-everything Collaborative Perception via Point Cluster

http://arxiv.org/abs/2403.16635v1

Compressor summary: The paper introduces a new message unit, point cluster, for enhanced vehicle-to-everything perception, overcoming issues with existing methods using bandwidth and dense representations.


A comparative analysis of embedding models for patent similarity

http://arxiv.org/abs/2403.16630v1

Compressor summary: The paper evaluates patent-specific pretrained embedding models and Sentence Transformers for text-based patent similarity, using patent interferences as ground-truth, and proposes Patent SBERT-adapt-ub, which outperforms current methods.


SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

http://arxiv.org/abs/2403.16627v1

Compressor summary: Key points: - Diffusion models are powerful but have drawbacks like complex architecture and high latency - The paper introduces a dual approach to reduce model latency: miniaturization and fewer sampling steps - The method uses knowledge distillation, feature matching, and score distillation - Two new models, SDXS-512 and SDXS-1024, are faster than previous ones on a single GPU Summary: The paper proposes a dual approach to speed up diffusion models by miniaturizing the architecture and reducing sampling steps, using knowledge distillation and other techniques. The new models are significantly faster than previous ones on a single GPU.


Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts

http://arxiv.org/abs/2403.16614v1

Compressor summary: The authors propose multi-lingual sentence encoders for crisis-related social media texts in over 50 languages, improving semantic similarity and contextual understanding.


Calibrating Bayesian UNet++ for Sub-Seasonal Forecasting

http://arxiv.org/abs/2403.16612v1

Compressor summary: The paper proposes a method for calibrating neural network-based seasonal temperature forecasts, improving their reliability and sharpness.


Conversational Grounding: Annotation and Analysis of Grounding Acts and Grounding Units

http://arxiv.org/abs/2403.16609v1

Compressor summary: The text discusses the importance of conversational grounding, which helps build trustworthy dialog systems by ensuring shared understanding between parties, and introduces a framework for annotating dialog corpora with grounding acts and units to improve this capability.


Enhancing Industrial Transfer Learning with Style Filter: Cost Reduction and Defect-Focus

http://arxiv.org/abs/2403.16607v1

Compressor summary: Style Filter is a method that improves transfer learning performance for industrial data by selectively filtering source domain data without using labels or prior knowledge.


SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation

http://arxiv.org/abs/2403.16605v1

Compressor summary: This paper proposes a method to generate high-quality semantic segmentation masks for satellite images using generative image diffusion, which improves performance in earth observation tasks.


TrustAI at SemEval-2024 Task 8: A Comprehensive Analysis of Multi-domain Machine Generated Text Detection Techniques

http://arxiv.org/abs/2403.16592v1

Compressor summary: Key points: - The paper presents methods to detect machine-generated text across various domains and languages - It analyzes different approaches, such as statistical, neural, and pre-trained models - It reports accuracy results on two subtasks - It discusses challenges and factors for future research Summary: The paper proposes and evaluates methods to identify machine-generated text using various approaches and languages, achieving high accuracy and highlighting future directions.


Deciphering the Interplay between Local Differential Privacy, Average Bayesian Privacy, and Maximum Bayesian Privacy

http://arxiv.org/abs/2403.16591v1

Compressor summary: This paper compares local differential privacy (LDP) and Bayesian privacy methods for measuring privacy in machine learning and proposes a framework to better understand their trade-offs and effectiveness.


Can Large Language Models (or Humans) Distill Text?

http://arxiv.org/abs/2403.16584v1

Compressor summary: Large language models can partially remove unwanted information from text, but struggle to completely eliminate sentiment without losing semantic content.


In the Search for Optimal Multi-view Learning Models for Crop Classification with Global Remote Sensing Data

http://arxiv.org/abs/2403.16582v1

Compressor summary: The paper investigates how to choose the best encoder and fusion strategy for multi-view learning in crop classification using various temporal data sources, suggesting a framework for researchers.


SegICL: A Universal In-context Learning Framework for Enhanced Segmentation in Medical Imaging

http://arxiv.org/abs/2403.16578v1

Compressor summary: SegICL is a novel approach that uses in-context learning for text-guided image segmentation, enabling adaptation to new tasks without training or fine-tuning the model.


NSINA: A News Corpus for Sinhala

http://arxiv.org/abs/2403.16571v1

Compressor summary: The study introduces NSINA, a large Sinhala news corpus and NLP tasks, to address challenges in adapting LLMs to low-resource languages like Sinhala.


Elysium: Exploring Object-level Perception in Videos via MLLM

http://arxiv.org/abs/2403.16558v1

Compressor summary: Elysium is a model that uses a large dataset of video frames with object boxes and descriptions to perform object tracking and expression generation tasks in videos, overcoming challenges related to pretraining and computational cost.


PE: A Poincare Explanation Method for Fast Text Hierarchy Generation

http://arxiv.org/abs/2403.16554v1

Compressor summary: The paper introduces Poincar'e Explanation, a method for explaining deep learning models in NLP using hyperbolic spaces, which capture syntax and semantics better than Euclidean spaces.


Efficient Information Extraction in Few-Shot Relation Classification through Contrastive Representation Learning

http://arxiv.org/abs/2403.16543v1

Compressor summary: The paper proposes a novel method for few-shot relation classification using multiple sentence representations and contrastive learning to extract complementary information.


DOrA: 3D Visual Grounding with Order-Aware Referring

http://arxiv.org/abs/2403.16539v1

Compressor summary: DOrA is a new 3D visual grounding framework that uses Large Language Models to suggest an order of anchor objects, improving the accuracy of locating target objects in scenes described by natural language.


VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting

http://arxiv.org/abs/2403.16536v1

Compressor summary: The paper introduces VMRNN cells, a new recurrent unit that combines Vision Mamba blocks and LSTM for spatiotemporal forecasting tasks.


An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

http://arxiv.org/abs/2403.16530v1

Compressor summary: Intermediate fusion improves text-to-image alignment and efficiency in diffusion models by fusing conditioning text in a specially designed space instead of using early fusion in pretrained image features.


Open-Set Recognition in the Age of Vision-Language Models

http://arxiv.org/abs/2403.16528v1

Compressor summary: Vision-language models are not open-set models despite being trained on large datasets because their finite query sets introduce closed-set assumptions, leading to low precision and recall in open-set conditions.


Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the Art

http://arxiv.org/abs/2403.16527v1

Compressor summary: Autonomous systems use modular sub-components or foundation models for decision-making, but need better ways to detect and mitigate hallucinations that can lead to poor performance in out-of-distribution scenarios.


ModeTv2: GPU-accelerated Motion Decomposition Transformer for Pairwise Optimization in Medical Image Registration

http://arxiv.org/abs/2403.16526v1

Compressor summary: The study presents a pyramid network with ModeTv2 operator that optimizes deformable image registration in medical imaging, improving accuracy, efficiency, and generalizability while being fast and interpretable.


Harnessing the power of LLMs for normative reasoning in MASs

http://arxiv.org/abs/2403.16524v1

Compressor summary: This paper explores using large language models to create socially aware agents that can discover, reason, and make decisions based on norms, and discusses challenges and opportunities for collaboration across fields.


CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification

http://arxiv.org/abs/2403.16520v1

Compressor summary: The paper introduces CMViM, a new method to learn efficient and unified representations from multi-modal 3D medical images for AD diagnosis, using masked Vim autoencoder and contrastive learning.


Visually Guided Generative Text-Layout Pre-training for Document Intelligence

http://arxiv.org/abs/2403.16516v1

Compressor summary: ViTLP is a pre-training method for visual document understanding that generates text and layout sequences from document images and performs well on various downstream tasks.


Let Real Images be as a Judger, Spotting Fake Images Synthesized with Generative Models

http://arxiv.org/abs/2403.16513v1

Compressor summary: Key points: - Generative models can create realistic images, but have inconsistent artifacts - Proposed method uses natural traces to distinguish real from fake images - Method outperforms baselines and shows high accuracy on a real-world platform Summary: The paper introduces a method that uses natural traces learned from real images to detect artifacts in generative models' fake images, achieving better performance than previous methods.


LLMs Are Few-Shot In-Context Low-Resource Language Learners

http://arxiv.org/abs/2403.16512v1

Compressor summary: The study explores in-context learning (ICL) for low-resource languages, introduces query alignment to improve performance, and provides insights into its effectiveness and challenges.


Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

http://arxiv.org/abs/2403.16510v1

Compressor summary: The Make-Your-Anchor system generates anchor-style videos from one-minute clips with precise torso and hand movements, using a structure-guided diffusion model and a face enhancement module.


Human Understanding AI Paper Challenge 2024 -- Dataset Design

http://arxiv.org/abs/2403.16509v1

Compressor summary: The document outlines the datasets and challenges for the 2024 Human Understanding AI Paper Challenge, which focuses on developing AI technologies for understanding human daily life.


Return to Tradition: Learning Reliable Heuristics with Classical Machine Learning

http://arxiv.org/abs/2403.16508v1

Compressor summary: WL-GOOSE uses graph representations and WL features to learn heuristics faster and better than existing deep learning models, achieving competitive performance in several domains.


LARA: Linguistic-Adaptive Retrieval-Augmented LLMs for Multi-Turn Intent Classification

http://arxiv.org/abs/2403.16504v1

Compressor summary: LARA is a language model that improves multi-turn intent classification in chatbots across six languages by combining fine-tuning with adaptive retrieval techniques.


Medical Image Registration and Its Application in Retinal Images: A Review

http://arxiv.org/abs/2403.16502v1

Compressor summary: The text provides a comprehensive review of medical image registration methods from traditional and deep learning-based directions, focusing on recent advances in retinal image registration and its challenges.


Learning To Guide Human Decision Makers With Vision-Language Models

http://arxiv.org/abs/2403.16501v1

Compressor summary: The paper proposes a framework for assisting human decision making in high-stakes tasks using interpretable and task-specific textual guidance from AI models instead of taking control away from the expert.


Self-Supervised Learning for Medical Image Data with Anatomy-Oriented Imaging Planes

http://arxiv.org/abs/2403.16499v1

Compressor summary: The text proposes two self-supervised learning tasks using anatomy-oriented imaging planes to improve pretraining for medical image analysis.


PathoTune: Adapting Visual Foundation Model to Pathological Specialists

http://arxiv.org/abs/2403.16497v1

Compressor summary: PathoTune is a framework that adapts foundation models to pathology-specific tasks using multi-modal prompt tuning, improving performance over single-modality approaches and even outperforming specialized pathological models.


LSTTN: A Long-Short Term Transformer-based Spatio-temporal Neural Network for Traffic Flow Forecasting

http://arxiv.org/abs/2403.16495v1

Compressor summary: Key points: - LSTTN is a novel framework for long- and short-term traffic flow prediction using STGNNs - LSTTN leverages masked subseries Transformer to learn compressed and contextual temporal representations from long historical series - LSTTN extracts long-term trend, periodic features and short-term features by different convolution layers and fuses them for prediction - LSTTN outperforms baseline models on four real-world datasets in 60-minute-ahead forecasting Summary: LSTTN is a new traffic flow prediction model that uses STGNNs to learn from long historical series and extract long- and short-term features by various convolution layers, achieving significant improvement over baselines.


CT-Bound: Fast Boundary Estimation From Noisy Images Via Hybrid Convolution and Transformer Neural Networks

http://arxiv.org/abs/2403.16494v1

Compressor summary: CT-Bound is a fast and accurate method for estimating image boundaries using a hybrid neural network that combines Convolution and Transformer layers, enabling real-time video processing.


Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

http://arxiv.org/abs/2403.16483v1

Compressor summary: The paper proposes WHLL, a method to create a large-scale geoparsing corpus from Wikipedia articles using hyperlinks to annotate coordinates for multiple location expressions.


Determined Multi-Label Learning via Similarity-Based Prompt

http://arxiv.org/abs/2403.16482v1

Compressor summary: DMLL is a novel labeling setting for multi-label classification that reduces annotation cost by assigning determined labels to training instances, and this paper proposes a risk-consistent estimator and a similarity-based prompt learning method to learn from these determined labels.


REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices

http://arxiv.org/abs/2403.16481v1

Compressor summary: The paper proposes a novel method to synthesize new views in real-time on various scenes with rich view-dependent appearances, using meshes and a neural environment map.


Learning from Reduced Labels for Long-Tailed Data

http://arxiv.org/abs/2403.16469v1

Compressor summary: The paper introduces a new weakly supervised learning setting called Reduced Label that reduces labeling costs and preserves supervised information for long-tailed data classes, achieving better performance than existing methods.


Few-shot Named Entity Recognition via Superposition Concept Discrimination

http://arxiv.org/abs/2403.16463v1

Compressor summary: SuperCD is a method to improve few-shot named entity recognition by actively learning from superposition concepts and instances.


On the rates of convergence for learning with convolutional neural networks

http://arxiv.org/abs/2403.16459v1

Compressor summary: The paper analyzes approximation and learning abilities of CNNs using new bounds on their weights and size, and shows that they achieve minimax optimal convergence rates for various problems like regression and classification.


DeepMachining: Online Prediction of Machining Errors of Lathe Machines

http://arxiv.org/abs/2403.16451v1

Compressor summary: DeepMachining is a deep learning system for online prediction of lathe machine errors using manufacturing data and pre-trained models.


Camera-aware Label Refinement for Unsupervised Person Re-identification

http://arxiv.org/abs/2403.16450v1

Compressor summary: The paper proposes a camera-aware label refinement framework to improve unsupervised person re-identification by reducing feature distribution discrepancies across different cameras.


A Study on How Attention Scores in the BERT Model are Aware of Lexical Categories in Syntactic and Semantic Tasks on the GLUE Benchmark

http://arxiv.org/abs/2403.16447v1

Compressor summary: The study explores how attention scores in BERT models vary by lexical categories during fine-tuning for different tasks and finds that content words get more attention in semantic tasks while function words in syntactic tasks.


Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm

http://arxiv.org/abs/2403.16446v1

Compressor summary: The text proposes an automatic evaluation method for large language models (LLMs) in medical diagnosis using a multi-agent framework, standardized patients, and a Retrieval-Augmented Evaluation (RAE).


KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

http://arxiv.org/abs/2403.16444v1

Compressor summary: Key points: - Instruction Tuning is essential for large language models to perform well on specific tasks - Publicly available instruction datasets in English are widely used, but not in Korean - KIT-19 is a new instruction dataset for Korean NLP tasks, composed of 19 existing open-source datasets - KIT-19 helps train a Korean Pretrained LLM that significantly outperforms existing ones Summary: KIT-19 is a novel instruction dataset for Korean NLP that enables training a superior Korean Pretrained LLM.


CodeS: Natural Language to Code Repository via Multi-Layer Sketch

http://arxiv.org/abs/2403.16443v1

Compressor summary: The paper proposes CodeS, a framework that generates code repositories from natural language requirements using LLMs, and evaluates its performance using both automated and manual methods.


If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

http://arxiv.org/abs/2403.16442v1

Compressor summary: The paper proposes EX2, a method to analyze visual language models and find out what features they use to represent concepts, revealing the importance of non-visual attributes and spurious descriptions in VLM representations.


RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

http://arxiv.org/abs/2403.16440v1

Compressor summary: The paper proposes RCBEVDet, a method for fusing radar and camera data to improve 3D object detection in autonomous driving.


InstUPR : Instruction-based Unsupervised Passage Reranking with Large Language Models

http://arxiv.org/abs/2403.16435v1

Compressor summary: The paper presents InstUPR, a passage reranking method using instruction-tuned LLMs without extra fine-tuning, which performs better than unsupervised baselines and an instruction-tuned reranker.


$\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models

http://arxiv.org/abs/2403.16432v1

Compressor summary: Prompt-based learning on PLMs can be adversarially attacked with natural UATs generated by the $ extit{LinkPrompt}$ algorithm, affecting the performance of downstream NLP tasks.


DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

http://arxiv.org/abs/2403.16431v1

Compressor summary: The paper introduces DOCTR, a novel method for point scene understanding that leverages object-centric representation and Transformer decoder to optimize queries involving object relationships, achieving state-of-the-art performance on ScanNet dataset.


Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

http://arxiv.org/abs/2403.16428v1

Compressor summary: The HANDS23 challenge aims to improve 3D hand-object reconstruction from egocentric views by addressing challenges like occlusion, viewpoint bias, distortion, and motion blur.


Re2LLM: Reflective Reinforcement Large Language Model for Session-based Recommendation

http://arxiv.org/abs/2403.16427v1

Compressor summary: Key points: - The paper proposes a new method (Re2LLM) for session-based recommendation using large language models - Re2LLM guides LLMs to learn from their own errors and a knowledge base of hints - Re2LLM achieves better recommendations than existing methods Summary: The paper introduces Re2LLM, a method that enhances session-based recommendation with large language models by making them reflect on their mistakes and use hints from a knowledge base.


An Experiment with the Use of ChatGPT for LCSH Subject Assignment on Electronic Theses and Dissertations

http://arxiv.org/abs/2403.16424v1

Compressor summary: The study explores using ChatGPT to generate Library of Congress Subject Headings (LCSH) for electronic theses and dissertations (ETDs), finding that while LLMs can help with cataloging backlog, human catalogers are still needed for validity.


Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

http://arxiv.org/abs/2403.16422v1

Compressor summary: The paper proposes a new framework for improving Text-to-Image generation models that can handle lengthy and complex visual text, demonstrating significant improvements on two benchmarks.


An incremental MaxSAT-based model to learn balanced rules

http://arxiv.org/abs/2403.16418v1

Compressor summary: This paper proposes IMLIB, an interpretable machine learning model based on MaxSAT that generates balanced and accurate classification rules.


How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

http://arxiv.org/abs/2403.16416v1

Compressor summary: The text discusses the limitations and challenges of using large language models for user simulators in conversational recommender systems, proposing a new method called SimpleUserSim to improve recommendations.


Unsupervised Template-assisted Point Cloud Shape Correspondence Network

http://arxiv.org/abs/2403.16412v1

Compressor summary: TANet is a novel method for finding point-wise correspondences between deformable and unusual shapes in unsupervised point cloud shape correspondence.


Spike-NeRF: Neural Radiance Field Based On Spike Camera

http://arxiv.org/abs/2403.16410v1

Compressor summary: Spike-NeRF is a novel method for 3D reconstruction and viewpoint synthesis of high-speed scenes using continuous spike streams from moving spike cameras.


A Survey on Long Video Generation: Challenges, Methods, and Prospects

http://arxiv.org/abs/2403.16407v1

Compressor summary: This paper surveys recent advancements and paradigms in generating long-duration videos, discussing network design, conditioning techniques, datasets, evaluation metrics, and future directions.


ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation

http://arxiv.org/abs/2403.16400v1

Compressor summary: ASDF is an approach for in-situ AR visualization in assembly scenarios that combines object detection, 6D pose estimation, and assembly state detection to provide guidance and reduce errors.


Is There a One-Model-Fits-All Approach to Information Extraction? Revisiting Task Definition Biases

http://arxiv.org/abs/2403.16396v1

Compressor summary: Definition bias affects information extraction models, and a multi-stage framework is proposed to measure and mitigate it.


Multi-attention Associate Prediction Network for Visual Tracking

http://arxiv.org/abs/2403.16395v1

Compressor summary: The paper proposes a new network (MAPNet) with improved matchers for feature comparison in classification-regression tasks, which leads to better tracking performance on several benchmarks.


Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

http://arxiv.org/abs/2403.16394v1

Compressor summary: The paper proposes metrics to measure linguistic and visual skew in text-to-image generation datasets, showing that balanced phenomenological coverage improves generalization without increasing data size.


Concurrent Linguistic Error Detection (CLED) for Large Language Models

http://arxiv.org/abs/2403.16393v1

Compressor summary: The paper proposes CLED, a method to detect errors in LLMs' outputs by extracting linguistic features and using a concurrent classifier, without accessing the model's internal nodes.


Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

http://arxiv.org/abs/2403.16387v1

Compressor summary: Text-IF is a novel approach for image fusion that uses text guidance to address degradations and interactive needs, achieving better performance and flexibility than existing methods.


Dia-LLaMA: Towards Large Language Model-driven CT Report Generation

http://arxiv.org/abs/2403.16386v1

Compressor summary: Dia-LLaMA is a framework that adapts LLaMA2-7B for CT report generation by incorporating diagnostic information as guidance prompts, using a pre-trained ViT3D and disease prototype memory bank, and introducing disease-aware attention.


Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

http://arxiv.org/abs/2403.16385v1

Compressor summary: The paper presents a method to improve chart VQA models by leveraging LLMs as an automatic data annotator that generates question-answer annotations for chart images using a step-by-step generation procedure, achieving state-of-the-art accuracy on complex reasoning questions.


FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models

http://arxiv.org/abs/2403.16379v1

Compressor summary: The text describes the development of an efficient algorithm, FlashEval, for selecting a representative subset of data to evaluate text-to-image generative models quickly.


Real-time Adaptation for Condition Monitoring Signal Prediction using Label-aware Neural Processes

http://arxiv.org/abs/2403.16377v1

Compressor summary: The paper proposes a neural process-based approach that adapts to real-time condition monitoring signals, enables on-the-spot predictions, and incorporates qualitative information from individual units.


Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion

http://arxiv.org/abs/2403.16376v1

Compressor summary: Elite360D is a novel framework that uses ERP image and ICOSAP point set to estimate depth with better local-with-global representation and lower computational cost than previous methods.


ProIn: Learning to Predict Trajectory Based on Progressive Interactions for Autonomous Driving

http://arxiv.org/abs/2403.16374v1

Compressor summary: The paper proposes a progressive interaction network for autonomous driving that uses graph convolutions to better learn map constraints and social interactions, and a weight allocation mechanism for multi-modal training.


GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation

http://arxiv.org/abs/2403.16370v1

Compressor summary: GoodSAM framework uses a teacher assistant and Segment Anything Model to transfer knowledge for panoramic semantic segmentation without labeled data.


Learning Action-based Representations Using Invariance

http://arxiv.org/abs/2403.16369v1

Compressor summary: Key points: - The text introduces action-bisimulation encoding, a method to capture multi-step controllability in reinforcement learning with high-dimensional observations. - Action-bisimulation is inspired by bisimulation invariance pseudometric and extends single-step controllability with recursive invariance constraint. - The text shows that action-bisimulation pretraining improves sample efficiency and provides theoretical and qualitative analysis of its performance. Summary: The text presents action-bisimulation encoding, a novel method for reinforcement learning agents to learn multi-step controllability from high-dimensional observations using a recursive invariance constraint, which enhances sample efficiency and control relevance.


Distilling Semantic Priors from SAM to Efficient Image Restoration Models

http://arxiv.org/abs/2403.16368v1

Compressor summary: The proposed framework fuses and distills semantic priors from a segment anything model to improve image restoration performance without sacrificing inference efficiency.


Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion

http://arxiv.org/abs/2403.16365v1

Compressor summary: The text describes a new technique, Guided Diffusion Poisoning (GDP), that creates base samples for crafting more potent poisons and backdoors in neural networks trained on web-scraped data.


ChebMixer: Efficient Graph Representation Learning with MLP Mixer

http://arxiv.org/abs/2403.16358v1

Compressor summary: ChebMixer is a novel graph neural network architecture that uses Chebyshev polynomials for efficient and multiscale node representation learning, improving performance on various graph mining tasks.


Enhanced Facet Generation with LLM Editing

http://arxiv.org/abs/2403.16345v1

Compressor summary: The paper proposes two strategies for identifying query facets without using a search engine, which can improve performance in applications with private documents and constantly updated search engines.


Impact of Video Compression Artifacts on Fisheye Camera Visual Perception Tasks

http://arxiv.org/abs/2403.16338v1

Compressor summary: This paper analyzes how lossy video compression affects fisheye camera images used in autonomous driving systems and proposes a new metric and method to improve compression efficiency.


Graphs Generalization under Distribution Shifts

http://arxiv.org/abs/2403.16334v1

Compressor summary: GLIDER is a novel framework for graph-structured data that tackles the challenges of out-of-distribution generalization by diversifying variations across domains and minimizing representation space discrepancy, leading to improved performance in predicting semantic labels.