arxiv compressed, 2024-08-01

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-08-01 generated by the compressor, my personal LLM-based project.

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

http://arxiv.org/abs/2407.21794v1

Compressor summary: This text summarizes the evolution of out-of-distribution detection and related problems in vision language models, highlights the changes in definitions and benchmarks, and discusses future challenges and directions.

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

http://arxiv.org/abs/2407.21792v1

Compressor summary: The paper analyzes AI safety benchmarks and their relationship with general capabilities, suggesting that many benchmarks may be misleadingly correlated with capability improvements, and proposing a clearer framework for AI safety research.

Vision-Language Model Based Handwriting Verification

http://arxiv.org/abs/2407.21788v1

Compressor summary: The paper explores using vision language models to improve handwriting verification by providing clear explanations and adapting to diverse styles, but finds that CNN-based ResNet-18 performs better.

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

http://arxiv.org/abs/2407.21787v1

Compressor summary: Repeated inference sampling improves language model performance on various tasks by increasing coverage and cost-effectiveness.

The Llama 3 Herd of Models

http://arxiv.org/abs/2407.21783v1

Compressor summary: Llama 3 is a multilingual language model with various capabilities that compares well to GPT-4 and can be integrated with other modalities like image, video, and speech.

Tulip Agent -- Enabling LLM-Based Agents to Solve Tasks Using Large Tool Libraries

http://arxiv.org/abs/2407.21778v1

Compressor summary: Tulip agent is an autonomous AI agent that can search for tools in a large library, reducing inference costs and enabling adaptation and extension of its tool set.

RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining

http://arxiv.org/abs/2407.21773v1

Compressor summary: RainMamba is a new video deraining method that uses state space models, Hilbert scanning, and dynamic contrastive locality learning to effectively remove rain from outdoor vision systems.

ShieldGemma: Generative AI Content Moderation Based on Gemma

http://arxiv.org/abs/2407.21772v1

Compressor summary: ShieldGemma is a suite of models that use large language models to accurately predict safety risks in user input and generated output, outperforming existing models and providing a valuable resource to the research community.

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

http://arxiv.org/abs/2407.21771v1

Compressor summary: The paper proposes an algorithm to address text inertia and reduce hallucination in large vision-language models by adjusting attention weights and subtracting logits of multi-modal inputs.

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

http://arxiv.org/abs/2407.21770v1

Compressor summary: MoMa is a new architecture that improves the efficiency of mixed-modal language models by dividing expert modules into modality-specific groups and routing them to optimize pre-training.

Learning Video Context as Interleaved Multimodal Sequences

http://arxiv.org/abs/2407.21757v1

Compressor summary: The paper introduces MovieSeq, a multimodal language model that represents videos as interleaved sequences of images, plots, videos, subtitles, and other information to improve understanding and interaction with narrative videos.

HGOE: Hybrid External and Internal Graph Outlier Exposure for Graph Out-of-Distribution Detection

http://arxiv.org/abs/2407.21742v1

Compressor summary: HGOE is a model-agnostic framework that uses external and internal outliers to improve OOD detection for graph data by adaptively assigning weights to them with a boundary-aware loss function.

Contrastive Factor Analysis

http://arxiv.org/abs/2407.21740v1

Compressor summary: Contrastive Factor Analysis is a novel framework that combines contrastive learning and factor analysis to leverage their respective advantages in unsupervised representational learning.

Unifying Event-based Flow, Stereo and Depth Estimation via Feature Similarity Matching

http://arxiv.org/abs/2407.21735v1

Compressor summary: The EventMatch framework uses a single model to perform optical flow, stereo matching, and depth estimation with event cameras by comparing feature similarities across different inputs.

ParLS-PBO: A Parallel Local Search Solver for Pseudo Boolean Optimization

http://arxiv.org/abs/2407.21729v1

Compressor summary: The paper proposes an improved local search solver for Pseudo-Boolean Optimization problems that balances hard constraints and objective function scores, and develops a parallel version that shares solutions to guide the search and enhances scoring with polarity density.

Artificial Intelligence Approaches for Energy Efficiency: A Review

http://arxiv.org/abs/2407.21726v1

Compressor summary: The paper discusses AI applications for energy efficiency in smart buildings, focusing on multi-agent systems, IoT, Big Data, anomaly detection, and Intelligent Energy Management Systems classifications.

Detecting, Explaining, and Mitigating Memorization in Diffusion Models

http://arxiv.org/abs/2407.21720v1

Compressor summary: The authors present a method to detect and mitigate memorization in diffusion models, ensuring that generated images are not replications of training data and addressing legal concerns.

Assessing the State of AI Policy

http://arxiv.org/abs/2407.21717v1

Compressor summary: The text discusses the need for oversight of AI technologies by policymakers who lack technical knowledge, and provides an overview of existing guidelines and regulations at various levels.

UMMAN: Unsupervised Multi-graph Merge Adversarial Network for Disease Prediction Based on Intestinal Flora

http://arxiv.org/abs/2407.21714v1

Compressor summary: UMMAN is a novel method that uses Graph Neural Networks to predict intestinal flora diseases by learning the complex associations among gut microbes in an unsupervised way.

Social Learning through Interactions with Other Agents: A Survey

http://arxiv.org/abs/2407.21713v1

Compressor summary: The text discusses the role of social learning in human intelligence development and its potential application in machine learning, focusing on the use of embodied agents and natural language processing techniques.

Adaptive Retrieval-Augmented Generation for Conversational Systems

http://arxiv.org/abs/2407.21712v1

Compressor summary: The paper proposes a gating model (RAGate) to determine whether external knowledge is needed for improved system responses in conversational systems, based on human judgements and conversation context.

CEAR: Automatic construction of a knowledge graph of chemical entities and roles from scientific literature

http://arxiv.org/abs/2407.21708v1

Compressor summary: The authors propose a method to recognize chemical entities and their roles in scientific text using ontological knowledge from ChEBI and language understanding from LLMs, and create a knowledge graph (CEAR) to extend ChEBI.

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

http://arxiv.org/abs/2407.21705v1

Compressor summary: Tora is a framework that generates videos with controllable motion by integrating trajectory information into a diffusion transformer model, enabling high-quality and dynamic video generation.

Hyper-parameter tuning for text guided image editing

http://arxiv.org/abs/2407.21703v1

Compressor summary: Forgedit is a text-guided image editing method that can handle complex problems by remembering and understanding input images during finetuning, and uses a simple workflow with efficient hyper-parameter tuning for editing.

TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities

http://arxiv.org/abs/2407.21693v1

Compressor summary: The study introduces a new dataset for task-oriented dialogue systems, called TransferTOD, which simulates human-machine conversations in 30 life service scenarios and improves the performance of large language models in information gathering.

Explainable Artificial Intelligence for Quantifying Interfering and High-Risk Behaviors in Autism Spectrum Disorder in a Real-World Classroom Environment Using Privacy-Preserving Video Analysis

http://arxiv.org/abs/2407.21691v1

Compressor summary: This study uses video-based group activity recognition to develop a machine learning model that can objectively and continuously quantify behaviors in autism spectrum disorder (ASD) in real-world classroom environments, helping to track intervention effectiveness and allocate resources.

Dynamic Object Queries for Transformer-based Incremental Object Detection

http://arxiv.org/abs/2407.21687v1

Compressor summary: The paper proposes a Transformer-based method for incremental object detection that uses dynamic learnable queries to represent new and old classes and mitigates catastrophic forgetting through bipartite matching and risk-balanced calibration.

Expressive Whole-Body 3D Gaussian Avatar

http://arxiv.org/abs/2407.21686v1

Compressor summary: ExAvatar is a 3D human avatar that learns from monocular video and supports expressive whole-body movements, addressing challenges like limited diversity and absent 3D observations with a hybrid representation of mesh and 3D Gaussians.

Synthetic Simplicity: Unveiling Bias in Medical Data Augmentation

http://arxiv.org/abs/2407.21674v1

Compressor summary: Synthetic data can introduce simplicity bias in neural networks when there is a strong correlation between the data source and the task label, leading to poor deployment performance.

Universal Approximation Theory: Foundations for Parallelism in Neural Networks

http://arxiv.org/abs/2407.21670v1

Compressor summary: This paper presents a parallelization strategy for deep learning models based on the Universal Approximation Theorem to reduce training and inference times as more layers are added.

Synth-Empathy: Towards High-Quality Synthetic Empathy Data

http://arxiv.org/abs/2407.21669v1

Compressor summary: The paper introduces Synth-Empathy, a system that uses large language models to generate and select high-quality empathetic data, improving empathetic response performance and achieving state-of-the-art results on benchmarks and human evaluations.

An Explainable Vision Transformer with Transfer Learning Combined with Support Vector Machine Based Efficient Drought Stress Identification

http://arxiv.org/abs/2407.21666v1

Compressor summary: The text describes an explainable deep learning pipeline using vision transformers that detects drought stress in potato crops from aerial images, achieving high accuracy and providing insights into the plant features associated with stress.

Defending Jailbreak Attack in VLMs via Cross-modality Information Detector

http://arxiv.org/abs/2407.21659v1

Compressor summary: The text introduces CIDER, a cross-modality information detector that uses image and text similarity to detect jailbreaking attacks on vision language models.

Comgra: A Tool for Analyzing and Debugging Neural Networks

http://arxiv.org/abs/2407.21656v1

Compressor summary: Comgra is a PyTorch library that helps inspect neural networks by visualizing their internal activations and gradients in a GUI.

Spatial Transformer Network YOLO Model for Agricultural Object Detection

http://arxiv.org/abs/2407.21652v1

Compressor summary: The paper proposes a method to improve YOLO's performance in object detection by integrating spatial transformer networks, which focus on important image areas and enhance spatial invariance.

Human interaction classifier for LLM based chatbot

http://arxiv.org/abs/2407.21647v1

Compressor summary: The study finds that using an SVM model with Cohere embeddings is the best way to classify human interactions in AIDA, a chatbot system, based on speed and accuracy.

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

http://arxiv.org/abs/2407.21646v1

Compressor summary: CLASI is a human-like simultaneous speech translation system that uses a data-driven read-write strategy and multi-modal retrieval to convey information accurately and efficiently in various languages and scenarios.

Lyapunov weights to convey the meaning of time in physics-informed neural networks

http://arxiv.org/abs/2407.21642v1

Compressor summary: The paper proposes a principled way to adapt time weighting in Physics-Informed Neural Networks using Lyapunov exponents to handle different dynamics.

Quality Control for Radiology Report Generation Models via Auxiliary Auditing Components

http://arxiv.org/abs/2407.21638v1

Compressor summary: The authors propose a framework for quality control of AI-generated radiology reports by using auxiliary auditing components that assess the reliability and importance of the diagnoses.

Zero-Shot Cross-Domain Dialogue State Tracking via Dual Low-Rank Adaptation

http://arxiv.org/abs/2407.21633v1

Compressor summary: The paper proposes Dual Low-Rank Adaptation (DualLoRA), a method to improve zero-shot dialogue state tracking by enhancing prompt influence in transformer models without increasing inference latency.

RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

http://arxiv.org/abs/2407.21631v1

Compressor summary: RoadFormer+ is a model that fuses different types of data for urban scene parsing, improving efficiency and performance over the previous RoadFormer model.

TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization Methods

http://arxiv.org/abs/2407.21630v1

Compressor summary: TAROT is a new method for hiding an author's identity in a text while maintaining its usefulness, using policy optimization over small language models.

EZSR: Event-based Zero-Shot Recognition

http://arxiv.org/abs/2407.21616v1

Compressor summary: The paper proposes a new event encoder for zero-shot object recognition using event camera data and synthetic RGB images, improving performance over previous methods that relied on RGB frame reconstructions.

MicroMIL: Graph-based Contextual Multiple Instance Learning for Patient Diagnosis Using Microscopy Images

http://arxiv.org/abs/2407.21604v1

Compressor summary: The paper introduces MicroMIL, a weakly-supervised framework that uses deep cluster embedding and Gumbel Softmax to analyze microscopy images for histopathology research, improving efficiency and accuracy over existing methods.

Measuring What Matters: Intrinsic Distance Preservation as a Robust Metric for Embedding Quality

http://arxiv.org/abs/2407.21590v1

Compressor summary: This paper introduces IDPE, a novel method for evaluating unsupervised embeddings based on preserving Mahalanobis distances between data points in original and embedded spaces, providing a more reliable and comprehensive assessment than traditional extrinsic metrics.

InScope: A New Real-world 3D Infrastructure-side Collaborative Perception Dataset for Open Traffic Scenarios

http://arxiv.org/abs/2407.21581v1

Compressor summary: Key points: - Autonomous vehicles' perception systems can miss occluded objects due to vehicle-centric perspective - V2X paradigm proposes infrastructure-side perception system (IPS) to complement autonomous vehicles - InScope is a new 3D infrastructure-side collaborative perception dataset with LiDAR sensors on infrastructure side - InScope provides benchmarks for various tasks related to occlusion challenges in V2X scenarios - InScope improves detection and tracking of obscured, small, and distant objects Summary: InScope is a new dataset that helps autonomous vehicles detect and track occluded objects by using LiDAR sensors on the infrastructure side. It also provides benchmarks for evaluating V2X technologies in occlusion scenarios.

Multi-Site Class-Incremental Learning with Weighted Experts in Echocardiography

http://arxiv.org/abs/2407.21577v1

Compressor summary: The paper proposes a class-incremental learning method for echocardiography view classification that combines expert networks with a score fusion model to handle data diversity and privacy concerns.

PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning

http://arxiv.org/abs/2407.21571v1

Compressor summary: PMoE is a novel method that uses an asymmetric Transformer to reduce forgetting in large Language Models by adding progressive experts and routing new knowledge to appropriate layers.

TRGR: Transmissive RIS-aided Gait Recognition Through Walls

http://arxiv.org/abs/2407.21566v1

Compressor summary: TRGR is a novel system that uses transmissive RIS to enhance gait recognition through walls using only magnitude measurements of RF signals.

Generative Sentiment Analysis via Latent Category Distribution and Constrained Decoding

http://arxiv.org/abs/2407.21560v1

Compressor summary: The study introduces a generative sentiment analysis model that addresses challenges in fine-grained sentiment analysis by using a latent category distribution variable, a variational autoencoder, and a trie data structure with constrained decoding.

Operator-based semantics for choice programs: is choosing losing? (full version)

http://arxiv.org/abs/2407.21556v1

Compressor summary: The paper introduces a framework to compare different semantics of choice constructs in logic programming.

Conditioned Prompt-Optimization for Continual Deepfake Detection

http://arxiv.org/abs/2407.21554v1

Compressor summary: Prompt2Guard is a novel deepfake detection method that uses vision-language models and multimodal prompts to continuously detect photorealistic fake images without relying on prompt selection accuracy or multiple forward passes.

CXSimulator: A User Behavior Simulation using LLM Embeddings for Web-Marketing Campaign Assessment

http://arxiv.org/abs/2407.21553v1

Compressor summary: The paper introduces a CX Simulator that uses large language models to predict user behavior transitions and simulate web-marketing campaign effects without costly online testing.

Black box meta-learning intrinsic rewards for sparse-reward environments

http://arxiv.org/abs/2407.21546v1

Compressor summary: This paper explores how meta-learning can enhance reinforcement learning by optimizing intrinsic rewards without using meta-gradients, and compares it to other methods in continuous control tasks with sparse rewards.

Probabilistic Scoring Lists for Interpretable Machine Learning

http://arxiv.org/abs/2407.21535v1

Compressor summary: The paper proposes probabilistic scoring lists (PSL), an extension of scoring systems that represent uncertainty with probability distributions, and a method for learning PSLs from data to improve explainability in AI decisions.

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

http://arxiv.org/abs/2407.21534v1

Compressor summary: The paper proposes a way to improve multimodal language models' visual referring ability by adjusting visual tokens based on text prompts during inference, without needing extra training.

Data Contamination Report from the 2024 CONDA Shared Task

http://arxiv.org/abs/2407.21530v1

Compressor summary: The CONDA 2024 workshop investigates data contamination in natural language processing and aims to create a shared task and database to collect evidence and prevent evaluation results on contaminated resources.

Skeleton-Based Action Recognition with Spatial-Structural Graph Convolution

http://arxiv.org/abs/2407.21525v1

Compressor summary: The paper introduces Spatial-Structural GCN, a new method for skeleton-based human activity recognition that leverages both the topological structure and the dynamic similarity of edge node sequences in graph convolutional networks.

Tabular Data Augmentation for Machine Learning: Progress and Prospects of Embracing Generative AI

http://arxiv.org/abs/2407.21523v1

Compressor summary: Key points: - Tabular data augmentation (TDA) enhances tabular data for machine learning tasks - TDA pipeline consists of pre-augmentation, augmentation, and post-augmentation procedures - Generative AI is a trending approach for TDA Summary: The text reviews the progress and prospects of TDA, which improves tabular data for ML using generative AI and other methods.

PhysFlow: Skin tone transfer for remote heart rate estimation through conditional normalizing flows

http://arxiv.org/abs/2407.21519v1

Compressor summary: PhysFlow is a method for improving remote heart rate estimation by augmenting skin diversity using conditional normalizing flows, reducing errors especially in darker skin tones.

A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging

http://arxiv.org/abs/2407.21517v1

Compressor summary: The paper proposes a simple framework to reduce computational cost in video snapshot compressive imaging using low-bit quantization and improves the performance of existing methods.

PEAR: Phrase-Based Hand-Object Interaction Anticipation

http://arxiv.org/abs/2407.21510v1

Compressor summary: PEAR is a novel model that anticipates both interaction intention and manipulation in first-person hand-object interaction, addressing uncertainties using cross-alignment and bidirectional constraints.

Root Cause Analysis Of Productivity Losses In Manufacturing Systems Utilizing Ensemble Machine Learning

http://arxiv.org/abs/2407.21503v1

Compressor summary: The study proposes a data-driven ensemble approach that analyzes productivity losses in automation systems, identifies root causes, and improves efficiency by integrating information theory and machine learning methods with stream processing.

Mitral Regurgitation Recogniton based on Unsupervised Out-of-Distribution Detection with Residual Diffusion Amplification

http://arxiv.org/abs/2407.21497v1

Compressor summary: The text proposes an unsupervised out-of-distribution detection method for mitral regurgitation diagnosis using ultrasound videos, which can improve accuracy and reduce misdiagnosis.

Generative Expressive Conversational Speech Synthesis

http://arxiv.org/abs/2407.21491v1

Compressor summary: The text introduces a new system called GPT-Talker that generates natural and expressive conversational speech for user-agent interactions using multimodal information, and proposes a large dataset to evaluate its performance.

Maverick: Efficient and Accurate Coreference Resolution Defying Recent Trends

http://arxiv.org/abs/2407.21489v1

Compressor summary: Maverick is a simple and efficient pipeline for coreference resolution that outperforms large generative models with up to 13 billion parameters using only 500 million parameters, achieving state-of-the-art results on the CoNLL-2012 benchmark.

Parallel Strategies for Best-First Generalized Planning

http://arxiv.org/abs/2407.21485v1

Compressor summary: The paper evaluates how to speed up generalized planning by applying parallel search techniques to a novel algorithm called Best-First Generalized Planning (BFGP).

eSPARQL: Representing and Reconciling Agnostic and Atheistic Beliefs in RDF-star Knowledge Graphs

http://arxiv.org/abs/2407.21483v1

Compressor summary: The paper proposes a four-valued logic query language called eSPARQL for operating with multiple and sometimes conflicting beliefs in epistemic RDF-star metadata.

On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition

http://arxiv.org/abs/2407.21476v1

Compressor summary: The text discusses how different TTS architectures affect synthetic data generation for ASR and SLT, but finds no clear relation between TTS performance and ASR performance.

Fine-gained Zero-shot Video Sampling

http://arxiv.org/abs/2407.21475v1

Compressor summary: The paper proposes a novel algorithm that generates high-quality videos from image synthesis models without extra training or optimization, using dependency noise model and temporal momentum attention for content consistency and animation coherence.

Deep Learning-Based Longitudinal Prediction of Childhood Myopia Progression Using Fundus Image Sequences and Baseline Refraction Data

http://arxiv.org/abs/2407.21467v1

Compressor summary: The study introduces a deep learning method that accurately predicts myopia progression and risk in children using fundus images and refraction data, enabling early interventions and reducing healthcare costs.

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

http://arxiv.org/abs/2407.21465v1

Compressor summary: The paper proposes a method, coded MarvelOVD, to improve open vocabulary detection by using the detector as an auxiliary guidance for vision language models and addressing the noise and bias issues in pseudo-labels.

Multi-agent Assessment with QoS Enhancement for HD Map Updates in a Vehicular Network

http://arxiv.org/abs/2407.21460v1

Compressor summary: The paper evaluates a multi-agent Q-learning solution for improving network performance in VANETs without increasing computational burden or compatibility issues.

KemenkeuGPT: Leveraging a Large Language Model on Indonesia's Government Financial Data and Regulations to Enhance Decision Making

http://arxiv.org/abs/2407.21459v1

Compressor summary: This study develops KemenkeuGPT, a Large Language Model using LLM techniques, to help Indonesia's Ministry of Finance make decisions with complex financial data and regulations, showing its potential as an essential tool.

StreetSurfaceVis: a dataset of crowdsourced street-level imagery with semi-automated annotations of road surface type and quality

http://arxiv.org/abs/2407.21454v1

Compressor summary: The paper introduces StreetSurfaceVis, a dataset with 9,122 street images annotated for road surface type and quality to train models for assessing road surfaces, addressing the imbalance and reducing manual annotation using various strategies.

TinyChirp: Bird Song Recognition Using TinyML Models on Low-power Wireless Acoustic Sensors

http://arxiv.org/abs/2407.21453v1

Compressor summary: The paper compares tinyML neural network architectures and compression techniques for bird song detection, focusing on the corn bunning species.

Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation

http://arxiv.org/abs/2407.21450v1

Compressor summary: The paper proposes a 3D video extrapolation method that disentangles geometry and motion, improving accuracy and quality of future video predictions.

Accelerating Image Super-Resolution Networks with Pixel-Level Classification

http://arxiv.org/abs/2407.21448v1

Compressor summary: The proposed method (PCSR) adaptively allocates computational resources to pixels based on their restoration difficulty, improving efficiency and performance for single image super-resolution.

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

http://arxiv.org/abs/2407.21443v1

Compressor summary: SliSum is a novel summary generation strategy that improves the faithfulness of LLMs by dividing the source article into overlapping windows and generating local summaries for each window, then aggregating them using clustering and majority voting.

QuestGen: Effectiveness of Question Generation Methods for Fact-Checking Applications

http://arxiv.org/abs/2407.21441v1

Compressor summary: The paper shows that automated question generation can improve fact-checking efficiency and sometimes yield better evidence than human-written questions.

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

http://arxiv.org/abs/2407.21439v1

Compressor summary: RagLLaVA is a novel framework that uses knowledge-enhanced reranking and noise-injected training to improve multimodal retrieval-augmented generation for dynamic contexts, addressing the multi-granularity noisy correspondence problem.

A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

http://arxiv.org/abs/2407.21438v1

Compressor summary: Key points: - Human-object interactions (HOI) detection is important for visual reasoning and scene understanding - Existing methods struggle with rare human-object pairs due to real world bias - CEFA is a novel framework that aligns generated data with original data at the feature level and bridges the domain gap - CEFA consists of a feature alignment module and a context enhancement module Summary: CEFA is a new framework for improving HOI detection on rare categories by aligning features and enhancing context of generated and original data.

Enriching thermal point clouds of buildings using semantic 3D building modelsenriching thermal point clouds of buildings using semantic 3D building models

http://arxiv.org/abs/2407.21436v1

Compressor summary: The proposed method enhances thermal point clouds with building semantics and location, improving thermal analysis and supporting deep learning models.

Analyzing the impact of semantic LoD3 building models on image-based vehicle localization

http://arxiv.org/abs/2407.21432v1

Compressor summary: The paper proposes a novel car localization method using image features and detailed 3D building models, improving accuracy in GNSS-denied urban areas.

Cost-Effective Hallucination Detection for LLMs

http://arxiv.org/abs/2407.21424v1

Compressor summary: The text describes a pipeline for detecting hallucinations in large language models' outputs by scoring, calibrating, and thresholding their confidence, and proposes a multi-scoring framework to improve performance and reduce costs.

Generalized Tampered Scene Text Detection in the era of Generative AI

http://arxiv.org/abs/2407.21422v1

Compressor summary: The paper proposes a new task, dataset, and framework for detecting text tampering in images by generative AI models, aiming to improve generalization and perception of forgery detection.

FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers

http://arxiv.org/abs/2407.21418v1

Compressor summary: FTuner is a new technique for deep learning compilers that uses uKernels to patch together small tensors, achieving comparable performance and speedup while reducing tuning time significantly.

Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models

http://arxiv.org/abs/2407.21417v1

Compressor summary: ReSet is a method to improve language models by combining self-instruction and rejection sampling, which enhances both faithfulness and instruction following compared to multi-task learning.

VIPeR: Visual Incremental Place Recognition with Adaptive Mining and Lifelong Learning

http://arxiv.org/abs/2407.21416v1

Compressor summary: VIPeR is a novel visual incremental place recognition method that adapts to new environments while preserving previous ones, using an adaptive mining strategy and a memory bank with probabilistic knowledge distillation.

Benchmarking AIGC Video Quality Assessment: A Dataset and Unified Model

http://arxiv.org/abs/2407.21408v1

Compressor summary: Key points: - Paper investigates subjective and objective quality assessment of AI-generated video (AIGC) - Constructs LGVQ dataset with 2,808 AIGC videos from 6 models and 468 text prompts - Evaluates existing metrics on LGVQ dataset and proposes UGVQ model to assess quality across three aspects Summary: The paper presents a large dataset and a new metric for evaluating the quality of AI-generated video from different perspectives, including spatial, temporal, and text-to-video alignment.

DD-rPPGNet: De-interfering and Descriptive Feature Learning for Unsupervised rPPG Estimation

http://arxiv.org/abs/2407.21402v1

Compressor summary: The paper presents a novel network, DD-rPPGNet, that eliminates interference in remote photoplethysmography (rPPG) signals to improve estimation performance.

SmileyNet -- Towards the Prediction of the Lottery by Reading Tea Leaves with AI

http://arxiv.org/abs/2407.21385v1

Compressor summary: SmileyNet is a neural network that uses smiley faces and positive reinforcement to predict coin flips using tea leaf images, outperforming other models and enabling lottery wins.

GEGA: Graph Convolutional Networks and Evidence Retrieval Guided Attention for Enhanced Document-level Relation Extraction

http://arxiv.org/abs/2407.21384v1

Compressor summary: GEGA is a novel graph neural network-based model for extracting relations between entities from unstructured document text, addressing challenges in evidence retrieval and complex cross-relations.

An Extended Kalman Filter Integrated Latent Feature Model on Dynamic Weighted Directed Graphs

http://arxiv.org/abs/2407.21376v1

Compressor summary: The study proposes a novel EKLF model that uses an Extended Kalman Filter to track complex temporal patterns and an ALS algorithm to train latent features for representing dynamic weighted directed graphs, achieving better prediction accuracy and efficiency than existing models.

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

http://arxiv.org/abs/2407.21368v1

Compressor summary: Key points: - Large Vision-Language Models (LVLMs) are successful in medical VQA but suffer from hallucination and imbalanced data problems - Two prompting strategies reduce hallucination and improve VQA performance on complex pathologies - The methods can be extended to general LVLM domains Summary: The authors propose two prompting strategies that enhance LVLMs' ability to diagnose complex medical pathologies by reducing hallucination and leveraging weak learners.

ESIQA: Perceptual Quality Assessment of Vision-Pro-based Egocentric Spatial Images

http://arxiv.org/abs/2407.21363v1

Compressor summary: This paper introduces ESIQAD, the first quality assessment database for egocentrical spatial images in eXtended Reality, and evaluates 22 IQA models using it.

ProSpec RL: Plan Ahead, then Execute

http://arxiv.org/abs/2407.21359v1

Compressor summary: ProSpec is a Reinforcement Learning method that uses prospective thinking to make optimal, lower-risk decisions by imagining future trajectories and employing cycle consistency for state reversibility and data efficiency.

Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs

http://arxiv.org/abs/2407.21358v1

Compressor summary: Tree-of-Traversals is a new algorithm that helps large language models use knowledge graphs for better reasoning and question answering.

Differentially Private Block-wise Gradient Shuffle for Deep Learning

http://arxiv.org/abs/2407.21347v1

Compressor summary: DP-BloGS is a new privacy-preserving algorithm for deep learning that shuffles gradients probabilistically and achieves fast training with strong protection against data extraction.

Dual-Constrained Dynamical Neural ODEs for Ambiguity-aware Continuous Emotion Prediction

http://arxiv.org/abs/2407.21344v1

Compressor summary: The paper proposes a new method to model how emotions change over time using neural networks and constraints, and shows good results on a speech emotion database.

High-throughput 3D shape completion of potato tubers on a harvester

http://arxiv.org/abs/2407.21341v1

Compressor summary: The authors developed a 3D shape completion network (CoRe++) that uses RGB-D cameras to estimate potato yield more accurately by completing the 3D shape of individual tubers and achieved fast and accurate results on an operational harvester.

Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

http://arxiv.org/abs/2407.21338v1

Compressor summary: The text introduces a new reinforcement learning method called NaSA-TD3 that uses intrinsic motivation to improve exploration and achieve better performance in complex, sparse environments with image inputs.

Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM

http://arxiv.org/abs/2407.21333v1

Compressor summary: Chat2Layout is an interactive system that uses large language models to generate and arrange furniture layouts in response to user input, enabling seamless communication and feedback-driven refinement.

CAMAv2: A Vision-Centric Approach for Static Map Element Annotation

http://arxiv.org/abs/2407.21331v1

Compressor summary: CAMAv2 is a vision-centric approach that generates high-quality, consistent, and accurate 3D annotations of static map elements without LiDAR inputs, improving the performance of models trained with these annotations.

Performance of Recent Large Language Models for a Low-Resourced Language

http://arxiv.org/abs/2407.21330v1

Compressor summary: Recent large language models, especially Claude and GPT 4o, have improved Sinhala performance compared to previous versions and other models, while Llama and Mistral can be enhanced by fine-tuning for better results.

MetaOpenFOAM: an LLM-based multi-agent framework for CFD

http://arxiv.org/abs/2407.21320v1

Compressor summary: MetaOpenFOAM is a novel framework that automates complex CFD simulations using natural language input and LLMs, achieving high pass rates and low costs per task.

Big Cooperative Learning

http://arxiv.org/abs/2407.21319v1

Compressor summary: The text discusses how cooperation in training foundation models leads to advancements in artificial intelligence and proposes a new model, BigLearn-GAN, as an example.

Pathology Foundation Models

http://arxiv.org/abs/2407.21317v1

Compressor summary: Pathology AI, especially Foundation Models, has greatly improved diagnosis and decision-making, but faces challenges for clinical application.

State-observation augmented diffusion model for nonlinear assimilation

http://arxiv.org/abs/2407.21314v1

Compressor summary: The text introduces a new data-driven algorithm (SOAD) for assimilating nonlinear physical and observational models, which improves accuracy over traditional methods.

EUDA: An Efficient Unsupervised Domain Adaptation via Self-Supervised Vision Transformer

http://arxiv.org/abs/2407.21311v1

Compressor summary: The paper introduces EUDA, an efficient domain adaptation framework that uses DINOv2 as a feature extractor and SDAL to balance adaptation and alignment, achieving comparable results with significantly fewer parameters.

Enhanced Self-Checkout System for Retail Based on Improved YOLOv10

http://arxiv.org/abs/2407.21308v1

Compressor summary: The paper introduces an improved self-checkout system using YOLOv10 for retail automation, with better product recognition and faster checkout speed than existing methods.

A Vectorization Method Induced By Maximal Margin Classification For Persistent Diagrams

http://arxiv.org/abs/2407.21298v1

Compressor summary: The authors propose a new geometric vectorization method for persistent diagrams, which improves protein function prediction by using topological data analysis.

TrackSorter: A Transformer-based sorting algorithm for track finding in High Energy Physics

http://arxiv.org/abs/2407.21290v1

Compressor summary: The paper introduces TrackSorter, a novel algorithm based on Transformers, that converts particle data into discrete tokens and sorts them to find tracks in High Energy Physics.

Multi-Level Querying using A Knowledge Pyramid

http://arxiv.org/abs/2407.21276v1

Compressor summary: The paper proposes a multi-layer knowledge pyramid approach in Retrieval-Augmented Generation methods to improve precision and recall, and shows better results than existing methods on two benchmarks.

FreqTSF: Time Series Forecasting Via Simulating Frequency Kramer-Kronig Relations

http://arxiv.org/abs/2407.21275v1

Compressor summary: The paper proposes a new method for long-term time series forecasting using frequency domain representations and a novel attention mechanism based on Kramer-Kronig relations, achieving significant performance improvements over existing methods.

Automated Quantification of Hyperreflective Foci in SD-OCT With Diabetic Retinopathy

http://arxiv.org/abs/2407.21272v1

Compressor summary: The authors propose an automated algorithm to measure hyperreflective foci in retinal images, which could help diagnose and monitor various retinal diseases.

Model Attribution in Machine-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

http://arxiv.org/abs/2407.21264v1

Compressor summary: The paper proposes a novel Supervised Contrastive Learning method to improve model attribution for machine-generated disinformation by focusing on the differences between large language models.

Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

http://arxiv.org/abs/2407.21260v1

Compressor summary: The paper analyzes how distributional reinforcement learning can improve performance in stochastic environments by using Bellman unbiasedness and moment functionals, and proposes an efficient algorithm with a regret bound.

Lifelong Person Search

http://arxiv.org/abs/2407.21252v1

Compressor summary: The paper introduces lifelong person search, a problem where models are incrementally trained on new datasets while preserving old dataset knowledge, using techniques like knowledge distillation and rehearsal-based instance matching.