arxiv compressed, 2024-05-23

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-05-23 generated by the compressor, my personal LLM-based project.


Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

http://arxiv.org/abs/2405.12981v1

Compressor summary: CLA reduces KV cache size by 2x while maintaining accuracy in transformer-based autoregressive LLMs.


OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

http://arxiv.org/abs/2405.12979v1

Compressor summary: OmniGlue is a learnable image matcher that uses a vision foundation model to improve generalization and a novel keypoint position-guided attention mechanism for better matching descriptors, achieving significant gains on various image domains.


Personalized Residuals for Concept-Driven Text-to-Image Generation

http://arxiv.org/abs/2405.12978v1

Compressor summary: The method generates efficient text-to-image diffusion models by learning residuals for specific concepts and applying localized attention-guided sampling to combine concept identity and generative prior.


BiomedParse: a biomedical foundation model for image parsing of everything everywhere all at once

http://arxiv.org/abs/2405.12971v1

Compressor summary: BiomedParse is a foundation model that uses textual descriptions to perform joint segmentation, detection, and recognition of 82 object types across 9 imaging modalities in biomedical image analysis.


Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

http://arxiv.org/abs/2405.12970v1

Compressor summary: Face-Adapter is an efficient adapter for high-precision and high-fidelity face editing using pre-trained diffusion models, decoupling target structure, ID, and attribute control.


Can We Treat Noisy Labels as Accurate?

http://arxiv.org/abs/2405.12969v1

Compressor summary: EchoAlign is a new approach that aligns instance features with noisy labels using controllable generative models and selective sampling to improve machine learning accuracy.


Energy Rank Alignment: Using Preference Optimization to Search Chemical Space at Scale

http://arxiv.org/abs/2405.12961v1

Compressor summary: ERA is a scalable algorithm that optimizes autoregressive policies using a gradient-based objective with an explicit reward function, achieving robust molecular search and alignment in chemical space.


Online Learning of Halfspaces with Massart Noise

http://arxiv.org/abs/2405.12958v1

Compressor summary: The paper proposes an efficient online learning algorithm for linear classifiers in the presence of Massart noise and extends it to a contextual bandit setting with consistent rewards.


Truncated Variance Reduced Value Iteration

http://arxiv.org/abs/2405.12952v1

Compressor summary: The paper presents faster randomized algorithms for finding an $\epsilon$-optimal policy in discounted Markov decision processes with known or estimated transition matrices, improving previous state-of-the-art results.


AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection

http://arxiv.org/abs/2405.12944v1

Compressor summary: The paper introduces AMFD, a framework that uses Modal Extraction Alignment to train student networks for multispectral pedestrian detection without increasing inference time, and presents the SMOD dataset for this task.


Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

http://arxiv.org/abs/2405.12939v1

Compressor summary: AoR is a new framework that improves the reasoning capabilities of large language models by selecting answers based on the evaluation of reasoning chains and adjusting the number of reasoning chains dynamically.


Skin-in-the-Game: Decision Making via Multi-Stakeholder Alignment in LLMs

http://arxiv.org/abs/2405.12933v1

Compressor summary: The Skin-in-the-Game framework enhances moral reasoning in large language models by simulating accountability for actions and considering multiple stakeholder perspectives.


Pytorch-Wildlife: A Collaborative Deep Learning Framework for Conservation

http://arxiv.org/abs/2405.12930v1

Compressor summary: Pytorch-Wildlife is an open-source platform that simplifies AI model development for wildlife monitoring using PyTorch, overcoming technical and interdisciplinary barriers.


Code-mixed Sentiment and Hate-speech Prediction

http://arxiv.org/abs/2405.12929v1

Compressor summary: The study evaluates how large language models perform in code-mixed settings for sentiment analysis and offensive language detection tasks, finding that bilingual and multilingual specialized models are the most successful.


On Image Registration and Subpixel Estimation

http://arxiv.org/abs/2405.12927v1

Compressor summary: Image registration problem in machine vision involves aligning images to subpixel accuracy using methods that depend on function complexity, pixel size, and sampling count.


Trusting Fair Data: Leveraging Quality in Fairness-Driven Data Removal Techniques

http://arxiv.org/abs/2405.12926v1

Compressor summary: The paper proposes a multi-objective optimization method to balance fairness and data quality in bias mitigation techniques, allowing users to choose the best subset for their application.


G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translation

http://arxiv.org/abs/2405.12915v1

Compressor summary: The paper proposes a gradient-based method to automatically select high-quality and diverse instruction finetuning data for machine translation by clustering on gradients and resampling.


An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

http://arxiv.org/abs/2405.12914v1

Compressor summary: The paper proposes a three-stage training pipeline to integrate Large Language Models (LLMs) into text-to-image generation, improving language understanding and image quality.


Topic Modelling Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment

http://arxiv.org/abs/2405.12910v1

Compressor summary: The paper presents a novel taxonomy for topic modelling summary judgment cases in the UK using AI, revealing patterns in their application across legal domains and demonstrating the potential of combining traditional and AI-driven approaches in legal classification.


Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents

http://arxiv.org/abs/2405.12900v1

Compressor summary: The study introduces a new training algorithm (ADPO) for open-domain dialogue systems that reduces toxicity by generating preferred and unsafe responses using a toxic control token and improves performance and stability compared to traditional methods.


DARK: Denoising, Amplification, Restoration Kit

http://arxiv.org/abs/2405.12891v1

Compressor summary: The paper presents a fast and effective image enhancement method for low-light conditions using machine learning and CNNs, which improves clarity and color accuracy without over-enhancement or unnatural colors.


Keep the Momentum: Conservation Laws beyond Euclidean Gradient Flows

http://arxiv.org/abs/2405.12888v1

Compressor summary: The paper explores conservation laws in non-Euclidean geometries and momentum-based dynamics for neural network training, finding temporal dependence and loss of conservation laws when transitioning from gradient flows to momentum dynamics.


Investigating Persuasion Techniques in Arabic: An Empirical Study Leveraging Large Language Models

http://arxiv.org/abs/2405.12884v1

Compressor summary: The paper studies persuasive techniques in Arabic social media using Pre-trained Language Models (PLMs) and finds that fine-tuning achieves the best results, while few-shot learning can improve the GPT model's performance.


Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

http://arxiv.org/abs/2405.12875v1

Compressor summary: The paper proposes a probabilistic diffusion model to generate descriptive language for semantic changes in bi-temporal remote sensing images, addressing the pixel problem and enhancing terrain change localization accuracy.


Equivariant Spatio-Temporal Attentive Graph Networks to Simulate Physical Dynamics

http://arxiv.org/abs/2405.12868v1

Compressor summary: This paper proposes Equivariant Spatio-Temporal Attentive Graph Networks (ESTAG), a new model that uses a novel Equivariant Discrete Fourier Transform (EDFT) to extract periodic patterns from past frames and improve the dynamics simulation of physical systems.


Transparency Distortion Robustness for SOTA Image Segmentation Tasks

http://arxiv.org/abs/2405.12864v1

Compressor summary: The paper proposes a method to synthetically augment datasets with spatially varying distortions and evaluates its impact on semantic segmentation models' performance.


Toward Constraint Compliant Goal Formulation and Planning

http://arxiv.org/abs/2405.12862v1

Compressor summary: The study examines how different ethical frameworks affect an agent's goal formulation and planning, and the importance of metacognitive judgments in resolving ethical conflicts.


Influence of Water Droplet Contamination for Transparency Segmentation

http://arxiv.org/abs/2405.12861v1

Compressor summary: The paper presents a dataset and evaluates how different levels of water droplet contamination on transparent objects affect computer vision tasks like segmentation.


Weakly supervised alignment and registration of MR-CT for cervical cancer radiotherapy

http://arxiv.org/abs/2405.12850v1

Compressor summary: The paper proposes a new algorithm for aligning CT and MRI images to improve cervical cancer diagnosis and treatment by using weakly supervised registration.


A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data

http://arxiv.org/abs/2405.12833v1

Compressor summary: The text discusses recent advances in deep learning-based methods for generating radiology reports from multi-modal data, summarizes key techniques, and proposes a general workflow with five components.


Wav-KAN: Wavelet Kolmogorov-Arnold Networks

http://arxiv.org/abs/2405.12832v1

Compressor summary: Wav-KAN is a new neural network architecture that uses wavelet functions to improve interpretability, speed, robustness, and performance over traditional MLPs and Spl-KAN.


Large Language Models Meet NLP: A Survey

http://arxiv.org/abs/2405.12819v1

Compressor summary: This study investigates how large language models (LLMs) are used in natural language processing (NLP), their current achievements and future possibilities, and provides a taxonomy and challenges for LLMs in NLP.


FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information

http://arxiv.org/abs/2405.12807v1

Compressor summary: The paper improves the Adam optimizer by correcting its flaws and refining it with insights from information geometry, leading to better performance in various domains.


MOSS: Motion-based 3D Clothed Human Synthesis from Monocular Video

http://arxiv.org/abs/2405.12806v1

Compressor summary: The MOSS framework uses kinematic information to create realistic motion-aware 3D clothed human reconstructions from single-view monocular videos.


Stochastic Inference of Plate Bending from Heterogeneous Data: Physics-informed Gaussian Processes via Kirchhoff-Love Theory

http://arxiv.org/abs/2405.12802v1

Compressor summary: The paper proposes a method to infer plate rigidity and other properties using physics-informed Gaussian Processes and Bayesian inference from noisy measurements, with potential applications in structural health monitoring and uncertainty quantification.


Comparing Neighbors Together Makes it Easy: Jointly Comparing Multiple Candidates for Efficient and Effective Retrieval

http://arxiv.org/abs/2405.12801v1

Compressor summary: The CMC framework compares a query and multiple candidates using self-attention layers, improving reranking performance while being scalable and lightweight.


DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

http://arxiv.org/abs/2405.12796v1

Compressor summary: DisenStudio is a novel framework that generates text-guided videos for multiple customized subjects using a diffusion-based model enhanced with spatial-disentangled cross-attention and motion-preserved disentangled finetuning.


Adaptive local boundary conditions to improve Deformable Image Registration

http://arxiv.org/abs/2405.12791v1

Compressor summary: The paper proposes a new method for medical image registration that adapts boundary conditions based on flow fields, improving accuracy in two registration tasks.


Anticipating Object State Changes

http://arxiv.org/abs/2405.12789v1

Compressor summary: Key points: - The paper proposes a method to predict object state changes in images and videos based on visual and linguistic cues - The method uses the Ego4D dataset and introduces a new annotation data (Ego4D-OSCA) for this task - The method shows good performance on predicting object state changes in dynamic scenarios Summary: The paper presents a novel framework that leverages visual and linguistic features to anticipate object state changes in images and videos, using a large dataset and a new annotation data.


What Have We Achieved on Non-autoregressive Translation?

http://arxiv.org/abs/2405.12788v1

Compressor summary: Non-autoregressive translation methods have improved but still lag behind autoregressive ones in terms of quality and reliability.


Artificial Intelligence Approaches for Predictive Maintenance in the Steel Industry: A Survey

http://arxiv.org/abs/2405.12785v1

Compressor summary: Predictive Maintenance (PdM) using AI is crucial for improving steel industry efficiency and sustainability, but faces challenges in practical implementation and reproducibility.


Generalize Polyp Segmentation via Inpainting across Diverse Backgrounds and Pseudo-Mask Refinement

http://arxiv.org/abs/2405.12784v1

Compressor summary: The authors propose a method to insert synthetic polyps into endoscopic images using inpainting, and improve pseudo-masks with a guided refinement network and data augmentation, resulting in better polyp segmentation performance.


Self-Supervised Modality-Agnostic Pre-Training of Swin Transformers

http://arxiv.org/abs/2405.12781v1

Compressor summary: SwinFUSE uses multi-modal pre-training to enhance 3D medical imaging segmentation by learning from CT and MRI, improving adaptability and generalizability.


Transformer in Touch: A Survey

http://arxiv.org/abs/2405.12779v1

Compressor summary: The text reviews how the Transformer model, which excels at natural language processing, can be applied to improve tactile perception tasks like object recognition and manipulation.


Blind Separation of Vibration Sources using Deep Learning and Deconvolution

http://arxiv.org/abs/2405.12774v1

Compressor summary: The paper proposes a blind separation method for vibration sources from rotating machinery, enabling early detection of gear-related and bearing faults using dilated CNNs and whitening-based deconvolution.


Cross-spectral Gated-RGB Stereo Depth Estimation

http://arxiv.org/abs/2405.12759v1

Compressor summary: The authors propose a method that combines gated imaging with stereo HDR RCCB cameras to capture accurate depth information using low-cost sensors and flood-illumination, outperforming existing methods for long ranges.


BIMM: Brain Inspired Masked Modeling for Video Representation Learning

http://arxiv.org/abs/2405.12757v1

Compressor summary: BIMM is a framework that mimics the human brain's visual pathway to learn comprehensive video representations using masked modeling and partial parameter sharing.


Parallel Algorithm for Optimal Threshold Labeling of Ordinal Regression Methods

http://arxiv.org/abs/2405.12756v1

Compressor summary: Ordinal regression is a method to classify ordinal data using a one-dimensional transformation of the explanatory variable and optimal threshold labeling, which can be learned efficiently with parallel processing.


Progress Measures for Grokking on Real-world Datasets

http://arxiv.org/abs/2405.12755v1

Compressor summary: This paper investigates grokking in real-world datasets using deep neural networks and introduces three new progress measures related to generalization that better explain grokking than weight norms.


C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

http://arxiv.org/abs/2405.12752v1

Compressor summary: The paper proposes C3L, a new method for generating VLIT data using contrastive learning and a content relevance module to improve the quality and content match between image instructions and generated images.


The Echoes of Multilinguality: Tracing Cultural Value Shifts during LM Fine-tuning

http://arxiv.org/abs/2405.12744v1

Compressor summary: The study investigates how multilingual language models (MLMs) revise cultural values during fine-tuning when exposed to new linguistic experience from different data sources and languages.


Multi-Subject Personalization

http://arxiv.org/abs/2405.12742v1

Compressor summary: The paper proposes Multi-Subject Personalization (MSP) to improve the quality and coherence of text-to-image models for creative story illustrations with multiple characters or objects.


SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

http://arxiv.org/abs/2405.12739v1

Compressor summary: The paper proposes Sequential Preference Optimization (SPO), a method to fine-tune large language models to align with multiple aspects of human preferences, without using explicit reward models.


Predicting the Influence of Adverse Weather on Pedestrian Detection with Automotive Radar and Lidar Sensors

http://arxiv.org/abs/2405.12736v1

Compressor summary: The paper presents a method to predict how rain and fog affect pedestrian detection by radar and lidar sensors using empirical data, improving upon existing models.


Leveraging Neural Radiance Fields for Pose Estimation of an Unknown Space Object during Proximity Operations

http://arxiv.org/abs/2405.12728v1

Compressor summary: The paper presents a novel method for estimating the 6D pose of an unknown spacecraft relative to a monocular camera using a Neural Radiance Field trained on sparse real-world images.


RemoCap: Disentangled Representation Learning for Motion Capture

http://arxiv.org/abs/2405.12724v1

Compressor summary: RemoCap is a method that uses spatial disentanglement and motion disentanglement to accurately reconstruct 3D human bodies from realistic motion sequences despite occlusions.


StarLKNet: Star Mixup with Large Kernel Networks for Palm Vein Identification

http://arxiv.org/abs/2405.12721v1

Compressor summary: The paper proposes StarLKNet, a large kernel convolution-based network for palm-vein identification, which improves security and convenience using Mixup and achieves superior performance on two public datasets.


Reinforcement Learning Enabled Peer-to-Peer Energy Trading for Dairy Farms

http://arxiv.org/abs/2405.12716v1

Compressor summary: The MAPDES simulator shows that using renewables and P2P energy trading can save costs and reduce demand for dairy farms.


Dynamic Identity-Guided Attention Network for Visible-Infrared Person Re-identification

http://arxiv.org/abs/2405.12713v1

Compressor summary: DIAN is a novel network that uses dynamic identity-guided attention to mine modality-consistent embeddings for visible-infrared person re-identification, achieving state-of-the-art performance.


A Masked Semi-Supervised Learning Approach for Otago Micro Labels Recognition

http://arxiv.org/abs/2405.12711v1

Compressor summary: The study proposes a machine learning approach to recognize micro activities of the Otago Exercise Program for older adults, using a Transformer encoder and a Temporal Convolutional Network, which can help monitor exercise intensity and difficulty automatically.


Text-Video Retrieval with Global-Local Semantic Consistent Learning

http://arxiv.org/abs/2405.12710v1

Compressor summary: GLSCL is a fast and effective text-video retrieval method that uses latent shared semantics across modalities, global and local interactions, and inter-consistency and intra-diversity losses to achieve high performance and efficiency.


Multimodal video analysis for crowd anomaly detection using open access tourism cameras

http://arxiv.org/abs/2405.12708v1

Compressor summary: The article describes a method to detect crowd anomalies from video data using pattern recognition and segmentation, which can help in decision-making for sectors like tourism and security.


Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting

http://arxiv.org/abs/2405.12705v1

Compressor summary: This paper proposes a multimodal early exit model design that balances performance and efficiency for visually-rich document understanding tasks, achieving faster speeds with similar accuracy to traditional models.


OLAPH: Improving Factuality in Biomedical Long-form Question Answering

http://arxiv.org/abs/2405.12701v1

Compressor summary: The paper introduces MedLFQA, a benchmark dataset for evaluating factuality in long medical questions and answers, and proposes OLAPH, a framework to improve factuality using sampling predictions and preference optimization.


Explainable offline automatic signature verifier to support forensic handwriting examiners

http://arxiv.org/abs/2405.12695v1

Compressor summary: The paper presents a new signature verifier that is easy to understand and performs well on public datasets, aiming to improve its use in forensic science and law.


Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text

http://arxiv.org/abs/2405.12689v1

Compressor summary: The paper proposes a method to detect AI-paraphrased text spans in a text by scoring each sentence based on its paraphrasing degree and evaluates it on a new dataset called PASTED.


A Multimodal Learning-based Approach for Autonomous Landing of UAV

http://arxiv.org/abs/2405.12681v1

Compressor summary: The paper presents a multimodal transformer-based detector for precise autonomous UAV landing and a DQN-based decision-making model for adaptive behaviour in outdoor scenarios.


Experimental investigation of trans-scale displacement responses of wrinkle defects in fiber reinforced composite laminates

http://arxiv.org/abs/2405.12676v1

Compressor summary: The text describes wrinkle defects in industrial products, presents a meso-mechanical modeling method to assess their stiffness, and compares two non-destructive testing methods (Shearography and FPP) for measuring their displacement responses.


A Survey on Multi-modal Machine Translation: Tasks, Methods and Challenges

http://arxiv.org/abs/2405.12669v1

Compressor summary: The paper provides an extensive overview of 99 multi-modal machine translation studies, analyzing factors affecting performance and discussing future directions.


EmoEdit: Evoking Emotions through Image Manipulation

http://arxiv.org/abs/2405.12661v1

Compressor summary: EmoEdit is a novel framework that uses psychological insights to modify images with content changes as well as color and style adjustments, enhancing emotional impact while preserving image composition.


Mitigating Overconfidence in Out-of-Distribution Detection by Capturing Extreme Activations

http://arxiv.org/abs/2405.12658v1

Compressor summary: The authors propose a method to detect out-of-distribution inputs in neural networks by measuring extreme activation values in the penultimate layer, which serves as a proxy for overconfidence.


Retrieval-Augmented Language Model for Extreme Multi-Label Knowledge Graph Link Prediction

http://arxiv.org/abs/2405.12656v1

Compressor summary: The paper proposes a new task and framework to improve extrapolation in large language models using structured knowledge graphs and retrieval-augmentation, addressing hallucination and expensive training costs.


Utilizing Description Logics for Global Explanations of Heterogeneous Graph Neural Networks

http://arxiv.org/abs/2405.12654v1

Compressor summary: The paper proposes using class expressions from description logic to explain node classification in graph-structured data more expressively and compares two scoring functions to identify the best explanation among multiple candidates.


Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

http://arxiv.org/abs/2405.12648v1

Compressor summary: CooK is a model that improves scene graph generation by incorporating co-occurrence knowledge between objects and addressing the long-tail problem in the training dataset.


PoseGravity: Pose Estimation from Points and Lines with Axis Prior

http://arxiv.org/abs/2405.12646v1

Compressor summary: The paper introduces a new efficient algorithm for estimating camera pose using intersections of a hyperbola and the unit circle, with simplified solutions for specific scenarios.


Multiscale lubrication simulation based on fourier feature networks with trainable frequency

http://arxiv.org/abs/2405.12638v1

Compressor summary: This paper presents a new neural network method for analyzing rough surface lubrication that adapts to different frequency components, improving on existing methods and offering a more accurate and efficient tool for this application.


Automating Attendance Management in Human Resources: A Design Science Approach Using Computer Vision and Facial Recognition

http://arxiv.org/abs/2405.12633v1

Compressor summary: Haar Cascade is a simple and low-cost algorithm for face detection in images and videos that can be used with OpenCV2 and NVIDIA Jetson Nano for efficient attendance tracking.


Exploration of Masked and Causal Language Modelling for Text Generation

http://arxiv.org/abs/2405.12630v1

Compressor summary: This paper compares masked language modeling (MLM) and causal language modeling (CLM) for text generation tasks and shows that MLM consistently generates better texts, but the quality of the generated texts does not strongly correlate with their performance in downstream tasks.


Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition

http://arxiv.org/abs/2405.12621v1

Compressor summary: ToM modelling is less important for effective collaboration in dialogue-based CPA tasks than previously thought, as performance improves more by predicting one's own missing knowledge.


MentalQA: An Annotated Arabic Corpus for Questions and Answers of Mental Healthcare

http://arxiv.org/abs/2405.12619v1

Compressor summary: MentalQA is an Arabic dataset with conversational Q&A on mental health, annotated using a rigorous schema, to support text mining tools for diagnosis and treatment.


Quantifying Emergence in Large Language Models

http://arxiv.org/abs/2405.12617v1

Compressor summary: The paper proposes a method to measure the "intelligent" behaviors of large language models (LLMs) by comparing entropy reduction at different levels of abstraction, and shows its effectiveness and novel insights on emergent patterns.


Learning Causal Dynamics Models in Object-Oriented Environments

http://arxiv.org/abs/2405.12615v1

Compressor summary: The paper introduces Object-Oriented CDM (OOCDM), a model that learns causal dependencies among objects in large-scale environments, and shows its advantages over existing CDMs in various aspects.


Tagengo: A Multilingual Chat Dataset

http://arxiv.org/abs/2405.12612v1

Compressor summary: Key points: - High quality dataset of 70k prompt-response pairs in 74 languages - Train open source English LLM to chat multilingually - Multilingual model outperforms previous state-of-the-art models - More multilingual data helps performance in target language (Japanese) Summary: The authors create a high quality dataset of multilingual prompt-response pairs and use it to train an open source English LLM that chats better than previous models, especially in Japanese.


S3O: A Dual-Phase Approach for Reconstructing Dynamic Shape and Skeleton of Articulated Objects from Single Monocular Video

http://arxiv.org/abs/2405.12607v1

Compressor summary: S3O is a new method that learns parametric models of dynamic objects from monocular videos without extra annotations, using a phased approach for efficient and robust 3D reconstruction.


Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming

http://arxiv.org/abs/2405.12604v1

Compressor summary: The paper proposes a sentinel model that reduces toxicity in responses from large language models by adding a few tokens to the input prompt, using an interleaved training method with PPO to improve both red team and sentinel models.


FFAM: Feature Factorization Activation Map for Explanation of 3D Detectors

http://arxiv.org/abs/2405.12601v1

Compressor summary: The paper proposes a method, FFAM, to generate high-quality visual explanations for LiDAR-based 3D object detection models using non-negative matrix factorization and feature gradient refinement.


Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

http://arxiv.org/abs/2405.12591v1

Compressor summary: DecoQuant is a data-free technique that compresses the key-value cache of large language models by using low-bit quantization on tensor decomposition, achieving significant memory savings and maintaining high inference quality.


Mining the Explainability and Generalization: Fact Verification Based on Self-Instruction

http://arxiv.org/abs/2405.12579v1

Compressor summary: This paper proposes a self-instruction based fine-tuning method for fact-checking that balances accuracy and explainability using data augmentation and improved DPO, achieving comparable or better results than traditional methods while ensuring data security.


Online Signature Recognition: A Biologically Inspired Feature Vector Splitting Approach

http://arxiv.org/abs/2405.12556v1

Compressor summary: The study explores how different feature vector splitting strategies and combinations improve biometric signature recognition in e-security applications by considering cognitive principles.


Like Humans to Few-Shot Learning through Knowledge Permeation of Vision and Text

http://arxiv.org/abs/2405.12543v1

Compressor summary: The paper proposes a method called BiKop to improve few-shot learning by leveraging both textual and visual knowledge in a hierarchical representation that balances generalization and specificity.


DrHouse: An LLM-empowered Diagnostic Reasoning System through Harnessing Outcomes from Sensor Data and Expert Knowledge

http://arxiv.org/abs/2405.12541v1

Compressor summary: DrHouse is a novel LLM-based virtual doctor system that uses smart device data, updates medical databases, and evaluates diseases' likelihood for more accurate diagnoses.


Context-Enhanced Video Moment Retrieval with Large Language Models

http://arxiv.org/abs/2405.12540v1

Compressor summary: The LMR approach uses Large Language Models to improve video context representation, align visual and language modalities, and achieve state-of-the-art results in Video Moment Retrieval, especially for complex queries.


Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

http://arxiv.org/abs/2405.12538v1

Compressor summary: The text proposes a framework to improve visual content generation by incorporating various knowledge sources and iteratively refining the output to better align with user intentions.


Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering

http://arxiv.org/abs/2405.12533v1

Compressor summary: The paper introduces a new multi-task Urdu scene text dataset for text detection, recognition, and VQA tasks, addressing the challenges of diverse text layouts and orientations.


PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

http://arxiv.org/abs/2405.12532v1

Compressor summary: PyramidInfer compresses the key-value cache of large language models by retaining crucial context layer by layer, improving performance and reducing memory consumption.


CustomText: Customized Textual Image Generation using Diffusion Models

http://arxiv.org/abs/2405.12531v1

Compressor summary: The paper proposes CustomText, a method that enhances image generation with precise text customization using a TextDiffuser model and a ControlNet model.


SirLLM: Streaming Infinite Retentive LLM

http://arxiv.org/abs/2405.12528v1

Compressor summary: SirLLM is a system that helps large language models process long inputs and maintain memory during infinite-length dialogues using token entropy and memory decay.


Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models

http://arxiv.org/abs/2405.12523v1

Compressor summary: The text proposes a method called Single Image Unlearning (SIU) for forgetting visual data in Multimodal Large Language Models (MLLMs) by fine-tuning a single image, introduces new evaluation metrics, and shows its effectiveness against attacks.


Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

http://arxiv.org/abs/2405.12522v1

Compressor summary: The paper presents a method using discrete sparse autoencodters to efficiently discover interpretable circuits in large language models by training on positive and negative examples and measuring attention head overlaps.


Unleash Graph Neural Networks from Heavy Tuning

http://arxiv.org/abs/2405.12521v1

Compressor summary: GNN-Diff is a framework that generates high-performing graph neural networks by learning from checkpoints during a light coarse search, reducing computational costs and improving generalization accuracy.


MAGE: Model-Level Graph Neural Networks Explanations via Motif-based Graph Generation

http://arxiv.org/abs/2405.12519v1

Compressor summary: MAGE is a novel motif-based approach to generate interpretable explanations for molecular graphs using attention, motif decomposition, and graph generation techniques.


Rethink Predicting the Optical Flow with the Kinetics Perspective

http://arxiv.org/abs/2405.12512v1

Compressor summary: The paper proposes a new optical flow estimation method that combines apparent and kinetics information, improves efficiency, considers warping and occlusion, and uses a self-supervised loss function to achieve better performance than existing methods.


Active Object Detection with Knowledge Aggregation and Distillation from Large Models

http://arxiv.org/abs/2405.12509v1

Compressor summary: The paper proposes a method to improve active object detection by using informed priors about possible interactions and knowledge distillation.


NOVA-3D: Non-overlapped Views for 3D Anime Character Reconstruction

http://arxiv.org/abs/2405.12505v1

Compressor summary: Key points: - The paper proposes a new framework (NOVA-3D) for reconstructing anime characters from non-overlapped views - The framework uses view-aware feature fusion and synthesis to learn 3D features effectively - The paper also collects a new dataset (NOVA-Human) with multi-view images and camera parameters for 3D anime characters - The method outperforms baseline approaches and achieves high quality reconstruction results Summary: The paper presents NOVA-3D, a novel framework that can reconstruct full-body anime characters from non-overlapped front and back views using view-aware feature fusion and synthesis. The paper also introduces a new dataset (NOVA-Human) for this task and shows superior performance over baselines.


CLRKDNet: Speeding up Lane Detection with Knowledge Distillation

http://arxiv.org/abs/2405.12503v1

Compressor summary: CLRKDNet is a streamlined model that balances lane detection accuracy with real-time performance using teacher-student distillation and new distillation losses, reducing inference time by up to 60% compared to the state-of-the-art CLRNet.


EntropyStop: Unsupervised Deep Outlier Detection with Loss Entropy

http://arxiv.org/abs/2405.12502v1

Compressor summary: The text proposes a zero-label entropy metric for detecting outliers in unlabeled contaminated datasets, enabling efficient training of deep outlier detection models with robust performance.


Entropic associative memory for real world images

http://arxiv.org/abs/2405.12500v1

Compressor summary: The entropic associative memory model can effectively process complex and unconventional images of animals and vehicles, generating meaningful associations between them.


Visualizing, Rethinking, and Mining the Loss Landscape of Deep Neural Networks

http://arxiv.org/abs/2405.12493v1

Compressor summary: The paper studies the complex loss landscapes of deep neural networks and categorizes different types of 1D and 2D curves that represent perturbation directions and surfaces, also providing theoretical insights using the Hessian matrix.


Customize Your Own Paired Data via Few-shot Way

http://arxiv.org/abs/2405.12490v1

Compressor summary: The text proposes a new image editing framework that enables users to customize effects with few image pairs using directional transformations and diffusion models, improving performance across different scenarios.


3DSS-Mamba: 3D-Spectral-Spatial Mamba for Hyperspectral Image Classification

http://arxiv.org/abs/2405.12487v1

Compressor summary: The 3D-Spectral-Spatial Mamba (3DSS-Mamba) framework uses a novel scanning mechanism to model global spectral-spatial relationships for hyperspectral image classification with improved efficiency and performance.


Gaussian Control with Hierarchical Semantic Graphs in 3D Human Recovery

http://arxiv.org/abs/2405.12477v1

Compressor summary: The HUGS framework uses semantic priors and high-frequency features to improve 3D human reconstruction by capturing geometric topology and surface details.


Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture Breeding

http://arxiv.org/abs/2405.12476v1

Compressor summary: FishPhenoKey is a large dataset with detailed annotations for measuring subtle morphological phenotypes in fish, along with new evaluation and loss functions to improve keypoint detection accuracy.


GASE: Graph Attention Sampling with Edges Fusion for Solving Vehicle Routing Problems

http://arxiv.org/abs/2405.12475v1

Compressor summary: GASE is a learning-based method that uses graph attention sampling with edge fusion to improve node embedding for vehicle routing problems, achieving better performance than existing methods.


How Universal Polynomial Bases Enhance Spectral Graph Neural Networks: Heterophily, Over-smoothing, and Over-squashing

http://arxiv.org/abs/2405.12474v1

Compressor summary: The paper introduces UniFilter, a novel adaptive polynomial filter for Graph Neural Networks that accommodates varying degrees of heterophily and improves convolution and propagation.


Leveraging Diverse Data Generation for Adaptable Zero-Shot Dialogue State Tracking

http://arxiv.org/abs/2405.12468v1

Compressor summary: The authors create a large dataset of diverse dialogue state tracking data using synthetic data generation and show that it improves zero-shot DST accuracy.


A finite element-based physics-informed operator learning framework for spatiotemporal partial differential equations on arbitrary domains

http://arxiv.org/abs/2405.12465v1

Compressor summary: The text proposes a new physics-informed operator learning framework that predicts spatiotemporal dynamics using finite element method and shows its effectiveness in a thermal conduction problem with arbitrary geometry.


Boosting X-formers with Structured Matrix for Long Sequence Time Series Forecasting

http://arxiv.org/abs/2405.12462v1

Compressor summary: The article introduces a new design for Transformer-based models that improves efficiency and accuracy for long sequence time series forecasting by replacing some components with Surrogate Attention Blocks and Surrogate FFN Blocks.


WorldAfford: Affordance Grounding based on Natural Language Instructions

http://arxiv.org/abs/2405.12461v1

Compressor summary: WorldAfford is a framework that uses natural language instructions to locate affordance regions of multiple objects in complex scenes, overcoming limitations of previous approaches.


Physics-based Scene Layout Generation from Human Motion

http://arxiv.org/abs/2405.12460v1

Compressor summary: The text describes a physics-based approach to automatically generate realistic scene layouts for 3D animation that considers physical constraints and interaction affordances using reinforcement learning.


PLM4Traj: Cognizing Movement Patterns and Travel Purposes from Trajectories with Pre-trained Language Models

http://arxiv.org/abs/2405.12459v1

Compressor summary: The proposed PLM4Traj model uses pre-trained language models to effectively learn trajectory features and adapt to different spatio-temporal data mining tasks.


Prompt-Enhanced Spatio-Temporal Graph Transfer Learning

http://arxiv.org/abs/2405.12452v1

Compressor summary: Spatio-Temporal Graph Prompting (STGP) is a framework for adapting spatio-temporal graph neural networks to diverse tasks and domains using a unified template and learnable prompts.


EPL: Empirical Prototype Learning for Deep Face Recognition

http://arxiv.org/abs/2405.12447v1

Compressor summary: The paper proposes a method to improve face recognition by adaptively updating prototypes using sample feature similarity and adjustable margins, which helps counter the effect of hard samples on the model performance.


FFCL: Forward-Forward Net with Cortical Loops, Training and Inference on Edge Without Backpropagation

http://arxiv.org/abs/2405.12443v1

Compressor summary: The FFL algorithm improves neural network training by optimizing label processing, revising label integration, and introducing feedback loops that enhance learning performance and efficiency.


No-Regret M${}^{\natural}$-Concave Function Maximization: Stochastic Bandit Algorithms and NP-Hardness of Adversarial Full-Information Setting

http://arxiv.org/abs/2405.12439v1

Compressor summary: The paper studies interactive optimization of natural-concave functions using stochastic and adversarial feedback, proving near-optimal regret bounds for some settings and showing impossibility results for others.


Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference

http://arxiv.org/abs/2405.12434v1

Compressor summary: ScenaFuse is a novel adapter for natural language inference that combines linguistic knowledge with visual information to enhance understanding and inference in ambiguous language.


LLM+Reasoning+Planning for supporting incomplete user queries in presence of APIs

http://arxiv.org/abs/2405.12433v1

Compressor summary: The proposed approach combines logical reasoning, classical AI planning, and an LLM to answer user queries accurately and handle missing information in API orchestration tasks.


Deep learning approaches to indoor wireless channel estimation for low-power communication

http://arxiv.org/abs/2405.12427v1

Compressor summary: The paper explores using Deep Learning to improve wireless communication for IoT devices by estimating channels from noisy signal strength measurements.