arxiv compressed, 2024-07-09

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-09 generated by the compressor, my personal LLM-based project.


Multi-Object Hallucination in Vision-Language Models

http://arxiv.org/abs/2407.06192v1

Compressor summary: This work studies how large vision language models hallucinate multiple objects in images, introduces a new evaluation protocol called ROPE, and identifies factors affecting these behaviors.


Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images

http://arxiv.org/abs/2407.06191v1

Compressor summary: Tailor3D is a novel pipeline that creates customized 3D objects from dual-side images by emulating a tailor's ability to edit and stitch together the front and back of garments.


4D Contrastive Superflows are Dense 3D Representation Learners

http://arxiv.org/abs/2407.06190v1

Compressor summary: SuperFlow is a novel framework that uses consecutive LiDAR-camera pairs to learn spatiotemporal features for accurate 3D perception in autonomous driving, reducing human annotations and improving performance across 11 heterogeneous datasets.


Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

http://arxiv.org/abs/2407.06189v1

Compressor summary: Video-STaR is a self-training approach that leverages existing labeled video datasets for improving large vision language models' performance in various tasks.


CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation

http://arxiv.org/abs/2407.06188v1

Compressor summary: CrowdMoGen is a text-driven framework that uses a large language model to generate realistic and flexible crowd motions in various scenarios without paired training data.


JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

http://arxiv.org/abs/2407.06187v1

Compressor summary: Jedi is a finetuning-free text-to-image model that uses reference images to generate personalized images with high quality and ease.


Stepping on the Edge: Curvature Aware Learning Rate Tuners

http://arxiv.org/abs/2407.06183v1

Compressor summary: The text discusses how curvature information affects learning rate tuning in deep learning, introduces a new method (CDAT) that stabilizes curvature better than classical methods, and shows CDAT's benefits in full batch and mini-batch regimes.


Transfer Learning with Self-Supervised Vision Transformers for Snake Identification

http://arxiv.org/abs/2407.06178v1

Compressor summary: The authors propose a method to identify snake species from images using Meta's DINOv2 vision transformer model and achieve promising results.


Vision-Language Models under Cultural and Inclusive Considerations

http://arxiv.org/abs/2407.06177v1

Compressor summary: The authors create a culture-centric benchmark to evaluate vision-language models' captioning abilities for visually impaired people in diverse settings, identifying challenges like hallucination and misalignment of automatic metrics with human judgment.


The Tug-of-War Between Deepfake Generation and Detection

http://arxiv.org/abs/2407.06174v1

Compressor summary: This paper surveys deepfake generation and detection techniques, stressing the importance of effective countermeasures and data diversity to combat misinformation and fraud.


Scaling Exponents Across Parameterizations and Optimizers

http://arxiv.org/abs/2407.05872v1

Compressor summary: The paper explores different parameterizations and optimizers for scaling models and proposes a new Adam variant called Adam-atan2 that avoids gradient underflow.


PORCA: Root Cause Analysis with Partially

http://arxiv.org/abs/2407.05869v1

Compressor summary: PORCA is a new RCA framework that can handle missing data and system heterogeneity, improving reliability in identifying root causes of faults.


KG-FPQ: Evaluating Factuality Hallucination in LLMs with Knowledge Graph-based False Premise Questions

http://arxiv.org/abs/2407.05868v1

Compressor summary: The authors propose an automated method to create false premise questions based on knowledge graphs, and introduce a large-scale benchmark for evaluating language models' vulnerability to factuality hallucination.


Neural Network-based Information Set Weighting for Playing Reconnaissance Blind Chess

http://arxiv.org/abs/2407.05864v1

Compressor summary: The text discusses using neural networks to assign weights to states in an information set for better gameplay in Reconnaissance Blind Chess, and shows that a Siamese network outperforms a convolutional one.


Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

http://arxiv.org/abs/2407.05862v1

Compressor summary: The paper proposes Point-CMAE, a method that combines masked autoencoder and contrastive learning for 3D point cloud pretraining with ViTs, improving representation quality and transfer performance.


Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU

http://arxiv.org/abs/2407.05858v1

Compressor summary: Mllm-npu is a novel system that efficiently leverages on-device neural processing units to speed up and save energy for large language models on mobile devices.


Wavelet Convolutions for Large Receptive Fields

http://arxiv.org/abs/2407.05848v1

Compressor summary: The authors propose a Wavelet Transform Convolution (WTConv) layer for CNNs that allows larger receptive fields with less parameters, improved robustness, and better shape recognition.


Anatomy-guided Pathology Segmentation

http://arxiv.org/abs/2407.05844v1

Compressor summary: The paper proposes a generalist segmentation model that combines anatomy and pathology information for improved medical image analysis, using a novel query-based transformer approach.


Evaluating the Fairness of Neural Collapse in Medical Image Classification

http://arxiv.org/abs/2407.05843v1

Compressor summary: The study explores the effect of Neural Collapse on bias in medical imaging, finding that it can cause a significant drop in performance when trained with biased data.


3D Vessel Graph Generation Using Denoising Diffusion

http://arxiv.org/abs/2407.05842v1

Compressor summary: The authors propose a new method to generate realistic 3D blood vessel networks using denoising diffusion models, which can handle complex structures like capillaries and the Circle of Willis.


An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models

http://arxiv.org/abs/2407.05841v1

Compressor summary: The paper proposes Constrained Word2Vec (CW2V), a simple and effective method for initializing language models' tokenizers when expanding them to new languages, without requiring cross-lingual embeddings.


Graph Reasoning Networks

http://arxiv.org/abs/2407.05816v1

Compressor summary: Graph Reasoning Networks (GRNs) combine fixed and learned graph representations with a differentiable solver to improve reasoning in graph-based machine learning.


Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

http://arxiv.org/abs/2407.05814v1

Compressor summary: The paper proposes a cross-domain few-shot in-context learning method using multimodal large language models to improve traffic sign recognition for autonomous driving.


MapsTP: HD Map Images Based Multimodal Trajectory Prediction for Automated Vehicles

http://arxiv.org/abs/2407.05811v1

Compressor summary: The text describes a multimodal approach to predict ego vehicle trajectories using ResNet-50, IMU sensor data, and high-definition maps, which improves accuracy and reliability in urban environments.


Integrating AI in College Education: Positive yet Mixed Experiences with ChatGPT

http://arxiv.org/abs/2407.05810v1

Compressor summary: The study examines how using AI chatbots in a medical imaging course affects students' engagement, perception, and learning outcomes, highlighting both benefits and concerns.


HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels

http://arxiv.org/abs/2407.05795v1

Compressor summary: The paper proposes HyCIR, which uses synthetic labels to improve ZS-CIR, and introduces SynCir, a pipeline that generates image-text pairs based on visual similarity and semantic similarity.


A Primal-Dual Online Learning Approach for Dynamic Pricing of Sequentially Displayed Complementary Items under Sale Constraints

http://arxiv.org/abs/2407.05793v1

Compressor summary: The paper proposes an online optimization algorithm for dynamically pricing complementary items with sequential display, sales constraint, and uncertainty, using a Markov Decision Process framework and online learning tools.


CANDID DAC: Leveraging Coupled Action Dimensions with Importance Differences in DAC

http://arxiv.org/abs/2407.05789v1

Compressor summary: The paper introduces a new DAC benchmark (CANDID) and shows that sequential policies perform better than independent learning in managing interdependent action dimensions with varying importance.


Automated Computational Energy Minimization of ML Algorithms using Constrained Bayesian Optimization

http://arxiv.org/abs/2407.05788v1

Compressor summary: Constrained Bayesian Optimization (CBO) minimizes energy use while maintaining model performance in machine learning.


Large Language Models for Judicial Entity Extraction: A Comparative Study

http://arxiv.org/abs/2407.05786v1

Compressor summary: This paper investigates how Large Language Models perform in identifying domain-specific entities within case law documents and finds that Mistral and Gemma are the most effective models for this task.


Regret Analysis of Multi-task Representation Learning for Linear-Quadratic Adaptive Control

http://arxiv.org/abs/2407.05781v1

Compressor summary: The text studies the benefits and challenges of using shared features for multiple robotic agents in dynamic environments with changing goals, and shows that representation learning can reduce regret in control tasks compared to single-task learning.


When is the consistent prediction likely to be a correct prediction?

http://arxiv.org/abs/2407.05778v1

Compressor summary: The paper challenges the self-consistency principle in large language models and suggests that consistent answers derived from longer reasoning texts are more likely to be correct due to autonomous chain-of-thought reasoning.


Structural Generalization in Autonomous Cyber Incident Response with Message-Passing Neural Networks and Reinforcement Learning

http://arxiv.org/abs/2407.05775v1

Compressor summary: Key points: - The text is about automated incident response agents based on machine learning that can handle network structure changes. - The method uses relational agent learning with a message passing neural network to encode the network state as a graph. - The approach is evaluated on a cyber incident simulator and shows advantages over default vector representation in some cases. Summary: The text presents a machine learning method for automated incident response agents that can adapt to changing network structures using relational agent learning and graph-based encoding, and evaluates its performance on a simulation.


Multi-times Monte Carlo Rendering for Inter-reflection Reconstruction

http://arxiv.org/abs/2407.05771v1

Compressor summary: Ref-MC2 is a new inverse rendering method that uses multi-time Monte Carlo sampling to accurately reconstruct reflective 3D objects with environmental illumination and inter-reflections, while reducing computational complexity and improving geometry accuracy.


Boosting 3D Object Detection with Semantic-Aware Multi-Branch Framework

http://arxiv.org/abs/2407.05769v1

Compressor summary: The proposed 3D object detection framework uses a Semantic-aware Multi-branch Sampling module and multi-view consistency constraints to improve performance, especially for low-performance backbones.


Enlarging Feature Support Overlap for Domain Generalization

http://arxiv.org/abs/2407.05765v1

Compressor summary: Key points: - Deep models struggle with out-of-distribution (OOD) generalization - Invariant risk minimization (IRM) learns invariant features and minimizes risk across domains - ERM and IRM may fail if pseudo-invariant features have insufficient support overlap - The proposed method uses Bayesian random semantic data augmentation to increase sample diversity and improve OOD generalization Summary: The paper proposes a new method that combines IRM with Bayesian random semantic data augmentation to enhance deep models' ability to handle OOD situations by increasing feature support overlap.


Large Language Models Understand Layouts

http://arxiv.org/abs/2407.05750v1

Compressor summary: The paper shows that large language models can process text layouts and answer questions requiring spatial reasoning, and this ability comes from pretraining data and instruction tuning.


Do Multilingual Large Language Models Mitigate Stereotype Bias?

http://arxiv.org/abs/2407.05740v1

Compressor summary: Multilingual large language models (LLMs) reduce bias and improve prediction accuracy compared to monolingual LLMs.


TransMA: an explainable multi-modal deep learning model for predicting properties of ionizable lipid nanoparticles in mRNA delivery

http://arxiv.org/abs/2407.05736v1

Compressor summary: The TransMA model predicts the efficiency of ionizable lipid nanoparticles (LNPs) for delivering mRNA using a multi-modal molecular structure fusion architecture that captures both 3D spatial and 1D sequential features, potentially speeding up the LNP design process.


Empirical Study of Symmetrical Reasoning in Conversational Chatbots

http://arxiv.org/abs/2407.05734v1

Compressor summary: The text explores how well chatbots can understand predicate symmetry, a human ability, using large language models and in-context learning, and compares their performance to humans.


Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

http://arxiv.org/abs/2407.05733v1

Compressor summary: The study suggests combining large language models and comparative judgment for automated essay scoring, outperforming traditional methods.


FairPFN: Transformers Can do Counterfactual Fairness

http://arxiv.org/abs/2407.05732v1

Compressor summary: The authors propose FairPFN, a transformer model that learns to eliminate the causal effects of protected attributes on observational data, improving counterfactual fairness in machine learning systems.


Gait Patterns as Biomarkers: A Video-Based Approach for Classifying Scoliosis

http://arxiv.org/abs/2407.05726v1

Compressor summary: The authors propose a video-based, non-invasive gait analysis method for scoliosis classification using a large-scale dataset, Scoliosis1K, and develop an enhanced model, ScoNet-MT, with promising diagnostic accuracy.


PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation

http://arxiv.org/abs/2407.05721v1

Compressor summary: PsycoLLM is a specialized psychological large language model trained on high-quality data and evaluated on psychological counseling exams, showing better performance than other LLMs.


A Factuality and Diversity Reconciled Decoding Method for Knowledge-Grounded Dialogue Generation

http://arxiv.org/abs/2407.05718v1

Compressor summary: The paper introduces DoGe, a novel method for dialogue generation that alternates between internal and external knowledge to balance factuality and diversity without relying on randomness.


Implementing a hybrid approach in a knowledge engineering process to manage technical advice relating to feedback from the operation of complex sensitive equipment

http://arxiv.org/abs/2407.05714v1

Compressor summary: The article describes how an industrial company used knowledge engineering methods to create a system, "SARBANES", that supports its business processes and preserves its expertise in a nuclear and defense context.


Short-term Object Interaction Anticipation with Disentangled Object Detection @ Ego4D Short Term Object Interaction Anticipation Challenge

http://arxiv.org/abs/2407.05713v1

Compressor summary: The paper introduces SOIA-DOD, a method that detects active objects and predicts their interactions in egocentric videos, achieving state-of-the-art performance.


MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices

http://arxiv.org/abs/2407.05712v1

Compressor summary: MobilePortrait is a lightweight method for real-time neural head avatars animation on mobile devices using mixed representation of keypoints and precomputed visual features.


Fast and Continual Knowledge Graph Embedding via Incremental LoRA

http://arxiv.org/abs/2407.05705v1

Compressor summary: The paper proposes a fast framework for continuous learning in knowledge graphs that preserves old knowledge and efficiently learns new knowledge using incremental low-rank adapters.


Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

http://arxiv.org/abs/2407.05704v1

Compressor summary: The paper proposes APO-MVP, an algorithm for learning in adversarial MDPs with an oblivious adversary and a reward function revealed at the end of each episode, achieving a regret bound of $\mathcal{O}(\text{poly}(H)\sqrt{SAT})$ and avoiding occupancy measures.


LGRNet: Local-Global Reciprocal Network for Uterine Fibroid Segmentation in Ultrasound Videos

http://arxiv.org/abs/2407.05703v1

Compressor summary: The paper introduces a new method (LGRNet) for segmenting uterine fibroids in ultrasound videos, which can help detect and treat them early to prevent malignant transformations.


InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

http://arxiv.org/abs/2407.05700v1

Compressor summary: The paper proposes INVERSE-INSTRUCT, a method that improves instruction-tuned code LLMs by generating instructions from code snippets instead of translating natural language to code, resulting in better performance on various benchmarks.


On the Limitations of Compute Thresholds as a Governance Strategy

http://arxiv.org/abs/2407.05694v1

Compressor summary: This essay explores the esoteric governance tool of compute thresholds, questions their effectiveness in mitigating risk, and suggests alternative approaches.


Sub-SA: Strengthen In-context Learning via Submodular Selective Annotation

http://arxiv.org/abs/2407.05693v1

Compressor summary: Sub-SA is a method that uses submodules and reward-penalty regularization to reduce annotation costs while improving in-context learning example quality.


Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

http://arxiv.org/abs/2407.05690v1

Compressor summary: TransAct is a structured pruning method that reduces computational overheads of large language models by pruning intra-module activations while preserving inter-module ones, achieving high compression with low efficiency and performance loss.


Learning with Alignments: Tackling the Inter- and Intra-domain Shifts for Cross-multidomain Facial Expression Recognition

http://arxiv.org/abs/2407.05688v1

Compressor summary: The paper proposes a novel method called LA-CMFER to improve cross-multidomain facial expression recognition by addressing inter- and intra-domain shifts using dual-level and multi-view alignment techniques.


Retrieved In-Context Principles from Previous Mistakes

http://arxiv.org/abs/2407.05682v1

Compressor summary: The paper proposes a new framework, Retrieved In-Context Principles (RICP), which uses the teacher model to generate reasons and insights from student model mistakes to improve in-context learning for reasoning tasks.


Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering

http://arxiv.org/abs/2407.05680v1

Compressor summary: Our method reconstructs high-fidelity hand models with intricate textures using inverse rendering, Graph Convolutional Networks, and mesh-based neural rendering.


BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

http://arxiv.org/abs/2407.05679v1

Compressor summary: BEVWorld is a novel approach that encodes multimodal sensor inputs into a unified latent space for environment modeling in autonomous driving, enabling future scenario prediction and improving downstream tasks.


LLM-Based Open-Domain Integrated Task and Knowledge Assistants with Programmable Policies

http://arxiv.org/abs/2407.05674v1

Compressor summary: Key points: - Programming agents that conform to policies is challenging and requires consistent, accurate, and relevant information - LLMs generate unfounded responses and dialogue trees are brittle - KITA is a framework for creating task-oriented agents with reliable grounded responses and controllable policies - KITA outperforms GPT-4 on accuracy, dialogue act, and goal completion rate in user study Summary: KITA is a programmable framework that creates task-oriented agents with reliable and controllable policies, unlike LLMs or dialogue trees, and shows superior performance in user study.


MSTF: Multiscale Transformer for Incomplete Trajectory Prediction

http://arxiv.org/abs/2407.05671v1

Compressor summary: The Multiscale Transformer model predicts missing values in motion forecasting for autonomous driving, improving accuracy and continuity by using multiscale attention and adaptive continuity representation.


Enhancing Neural Radiance Fields with Depth and Normal Completion Priors from Sparse Views

http://arxiv.org/abs/2407.05666v1

Compressor summary: CP_NeRF improves NeRF's view rendering by adding depth and normal dense completion priors, using sparse data to guide ray sampling and construct a normal loss function for better training accuracy.


DMSD-CDFSAR: Distillation from Mixed-Source Domain for Cross-Domain Few-shot Action Recognition

http://arxiv.org/abs/2407.05657v1

Compressor summary: Our novel approach for cross-domain few-shot action recognition uses a ResNet18 backbone and two branches for meta-training, integrating insights from labeled source data and unlabeled target data with domain encoders and dual distillation, improving generalization.


Multi-label Learning with Random Circular Vectors

http://arxiv.org/abs/2407.05656v1

Compressor summary: The paper proposes using random circular vectors for XMC tasks, which improve performance and reduce computational cost.


The Dynamic Net Architecture: Learning Robust and Holistic Visual Representations Through Self-Organizing Networks

http://arxiv.org/abs/2407.05650v1

Compressor summary: The Dynamic Net Architecture is a new intelligent-system architecture for vision that uses recurrence-stabilized networks to encode hierarchical feature representations, filter out irrelevant details, and generalize to unseen patterns.


Graph Attention with Random Rewiring

http://arxiv.org/abs/2407.05649v1

Compressor summary: GRASS is a new graph neural network architecture that combines message passing, graph rewiring, and attention mechanisms to enhance information propagation and achieve state-of-the-art performance.


Learning to Adapt Category Consistent Meta-Feature of CLIP for Few-Shot Classification

http://arxiv.org/abs/2407.05647v1

Compressor summary: MF-Adapter improves few-shot image classification by combining low-level and high-level features using Meta-Feature Units that measure local similarity.


OneDiff: A Generalist Model for Image Difference

http://arxiv.org/abs/2407.05645v1

Compressor summary: OneDiff is a novel generalist image difference captioning model that uses a robust vision-language architecture to accurately describe fine-grained variations between images and outperforms existing state-of-the-art models.


Deep Learning-based Anomaly Detection and Log Analysis for Computer Networks

http://arxiv.org/abs/2407.05639v1

Compressor summary: The paper proposes a fusion model combining Isolation Forest, GAN, and Transformer to improve anomaly detection and log analysis in network security tasks.


HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion

http://arxiv.org/abs/2407.05638v1

Compressor summary: The novel HPFF model improves deep learning by combining hierarchical locally supervised learning and patch-level feature computation, achieving state-of-the-art performance on various image datasets.


AdaPI: Facilitating DNN Model Adaptivity for Efficient Private Inference in Edge Computing

http://arxiv.org/abs/2407.05633v1

Compressor summary: AdaPI is a novel approach for private inference on encrypted data that adapts to different edge devices' energy budgets and improves test accuracy by 7.3% on CIFAR-100.


New Directions in Text Classification Research: Maximizing The Performance of Sentiment Classification from Limited Data

http://arxiv.org/abs/2407.05627v1

Compressor summary: The paper discusses sentiment analysis with limited data for classifying positive, negative, or neutral opinions on a text dataset about Kaesang Pangarep's appointment, using F1-score as the metric.


Momentum Auxiliary Network for Supervised Local Learning

http://arxiv.org/abs/2407.05623v1

Compressor summary: The paper proposes a Momentum Auxiliary Network (MAN) that improves information transfer between local blocks in deep neural networks, reducing GPU memory usage and increasing accuracy on image classification tasks.


On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries

http://arxiv.org/abs/2407.05622v1

Compressor summary: The paper studies how hard it is to learn sparse functions using gradient algorithms, introduces a new type of Statistical Queries called $\mathsf{DLQ}$ to model this process, and shows that the query complexity depends on the loss function used.


Explainable Image Recognition via Enhanced Slot-attention Based Classifier

http://arxiv.org/abs/2407.05616v1

Compressor summary: ESCOUTER is a visually explainable classifier that provides transparent insights into deep learning models' decision-making process by incorporating explanations into the final confidence scores and offering positive or negative explanations for all categories.


OSN: Infinite Representations of Dynamic 3D Scenes from Monocular Videos

http://arxiv.org/abs/2407.05615v1

Compressor summary: The paper proposes OSN, a framework that learns all plausible 3D scene configurations from monocular RGB videos using an object scale network and a joint optimization module.


GenFollower: Enhancing Car-Following Prediction with Large Language Models

http://arxiv.org/abs/2407.05611v1

Compressor summary: GenFollower is a novel approach that uses large language models to model car-following behaviors more accurately, interpreting factors influencing them, and improving traffic management and autonomous driving systems.


Described Spatial-Temporal Video Detection

http://arxiv.org/abs/2407.05610v1

Compressor summary: This paper introduces a new task called described spatial-temporal video detection (DSTVD) that can handle multiple objects in language descriptions of videos, and presents a new benchmark dataset DVD-ST to evaluate it.


Open-world Multi-label Text Classification with Extremely Weak Supervision

http://arxiv.org/abs/2407.05609v1

Compressor summary: The paper proposes a novel method called X-MLClass for open-world multi-label text classification under extremely weak supervision, which utilizes dominant keyphrases to discover and improve label space coverage and accuracy.


WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

http://arxiv.org/abs/2407.05603v1

Compressor summary: Key points: - Whole slide imaging (WSI) is used for carcinoma diagnosis and prognosis, but requires experienced pathologists and is time-consuming - WSI-VQA is a novel framework that uses generative visual question answering to interpret WSIs and perform various tasks such as grading, prediction, and subtyping - The WSI-VQA dataset contains 8672 slide-level question-answering pairs with 977 WSIs and the W2T model outperforms existing discriminative models in medical correctness Summary: WSI-VQA is a framework that uses generative visual question answering to interpret whole slide images for carcinoma diagnosis and prognosis, using a dataset with 8672 pairs and a model that is more medically accurate than existing ones.


GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

http://arxiv.org/abs/2407.05600v1

Compressor summary: GenArtist is a unified image generation and editing system using a multimodal large language model agent to coordinate existing models, plan procedures, and perform various tasks efficiently and reliably.


Generative Debunking of Climate Misinformation

http://arxiv.org/abs/2407.05599v1

Compressor summary: The study develops large language models that can automatically detect and correct climate misinformation using a "truth sandwich" structure and various prompting strategies.


GeoNLF: Geometry guided Pose-Free Neural LiDAR Fields

http://arxiv.org/abs/2407.05597v1

Compressor summary: GeoNLF is a hybrid method that combines neural reconstruction and geometric pose optimization for LiDAR point cloud synthesis, improving performance on sparse-view inputs.


SLIM: Spuriousness Mitigation with Minimal Human Annotations

http://arxiv.org/abs/2407.05594v1

Compressor summary: SLIM reduces spurious correlations in deep learning with minimal human input by using attention labeling and feature balancing, improving model reliability and efficiency.


Unmasking Trees for Tabular Data

http://arxiv.org/abs/2407.05593v1

Compressor summary: UnmaskingTrees is a tool for creating tables and filling in missing data using trees that gradually reveal features.


An Experimental Comparison of Transfer Learning against Self-supervised Learning

http://arxiv.org/abs/2407.05592v1

Compressor summary: The paper compares transfer learning and self-supervised learning in the medical field, evaluating their performance and robustness on small datasets with common medical issues.


On the Power of Convolution Augmented Transformer

http://arxiv.org/abs/2407.05591v1

Compressor summary: CAT is a new language model architecture that combines convolutional filters with attention, improving recall, copying, length generalization, and summarization tasks.


Dynamic Neural Radiance Field From Defocused Monocular Video

http://arxiv.org/abs/2407.05586v1

Compressor summary: D2RF is a method to restore sharp novel views from defocused monocular videos by modeling and removing depth-induced blur using layered Depth-of-Field volume rendering.


$\mathrm{E^{2}CFD}$: Towards Effective and Efficient Cost Function Design for Safe Reinforcement Learning via Large Language Model

http://arxiv.org/abs/2407.05580v1

Compressor summary: E^2CFD is a framework that uses a large language model to generate cost functions for safe reinforcement learning, improving policy performance in various safety scenarios.


FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

http://arxiv.org/abs/2407.05578v1

Compressor summary: FALIP improves CLIP's zero-shot performance in various tasks by adjusting its attention without altering the original image information.


ORMNet: Object-centric Relationship Modeling for Egocentric Hand-object Segmentation

http://arxiv.org/abs/2407.05576v1

Compressor summary: ORMNet is a novel end-to-end model for egocentric hand-object segmentation that leverages hand-guided attention and object relation decoupling to improve accuracy and reduce ambiguity.


Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN

http://arxiv.org/abs/2407.05577v1

Compressor summary: The paper proposes a method to edit talking face images with different emotions using audio-to-landmark and landmark-based editing modules, resulting in high-quality videos.


Towards Reflected Object Detection: A Benchmark

http://arxiv.org/abs/2407.05575v1

Compressor summary: The paper introduces a new benchmark dataset for reflective object detection with diverse images and annotations, revealing the limitations of existing methods in this area.


Spatio-Temporal Encoding and Decoding-Based Method for Future Human Activity Skeleton Synthesis

http://arxiv.org/abs/2407.05573v1

Compressor summary: The paper proposes a new method to predict human activities based on observed skeleton data using spatio-temporal encoding and decoding, outperforming some existing methods.


GMC: A General Framework of Multi-stage Context Learning and Utilization for Visual Detection Tasks

http://arxiv.org/abs/2407.05566v1

Compressor summary: The GMC framework is a general approach for multistage context learning and utilization in visual detection tasks that enhances performance and adaptability with user-defined configurations and diverse network architectures.


LLMBox: A Comprehensive Library for Large Language Models

http://arxiv.org/abs/2407.05563v1

Compressor summary: LLMBox is a library that simplifies the development, use, and evaluation of large language models by providing a unified data interface, comprehensive evaluation, and user-friendly efficiency.


Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

http://arxiv.org/abs/2407.05562v1

Compressor summary: The paper proposes a novel method to improve scene text recognition by enriching character features and addressing large intra-class variance and small inter-class variance issues.


$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

http://arxiv.org/abs/2407.05557v1

Compressor summary: $R^2$-Guard is a robust reasoning-enabled guardrail model for large language models that uses knowledge-enhanced logical reasoning to capture intercorrelations among safety categories, achieving better performance and resilience than existing methods.


PANS: Probabilistic Airway Navigation System for Real-time Robust Bronchoscope Localization

http://arxiv.org/abs/2407.05554v1

Compressor summary: PANS is a novel system that uses Monte-Carlo methods, depth-based motion inference, and bronchial semantic analysis to accurately and robustly localize bronchoscopes in real-time for pulmonary interventions.


A Color Image Analysis Tool to Help Users Choose a Makeup Foundation Color

http://arxiv.org/abs/2407.05553v1

Compressor summary: The paper proposes a method to predict skin color with foundation using two images and calibration with a color checker target.


Ada-adapter:Fast Few-shot Style Personlization of Diffusion Model with Pre-trained Image Encoder

http://arxiv.org/abs/2407.05552v1

Compressor summary: Ada-Adapter is a novel framework for few-shot style personalization of diffusion models that enables efficient zero-shot style transfer with limited source images and text prompts, achieving high-quality artistic stylizations.


Read, Watch and Scream! Sound Generation from Text and Video

http://arxiv.org/abs/2407.05551v1

Compressor summary: ReWaS is a novel video-and-text-to-sound generation method that uses video to control the synthesis of audio from text, allowing for flexible and high-quality sound production.


AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling

http://arxiv.org/abs/2407.05546v1

Compressor summary: The paper introduces Image Content Appeal Assessment (ICAA), a new metric that measures how appealing an image's content is to viewers, and shows its effectiveness in different domains.


LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction

http://arxiv.org/abs/2407.05547v1

Compressor summary: The paper proposes LaSe-E2V, a framework that uses language to guide event-to-video reconstruction, achieving semantic-aware high-quality results by combining event data with text-conditional diffusion models and an Event-guided Spatiotemporal Attention module.


GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation

http://arxiv.org/abs/2407.05540v1

Compressor summary: GTP-4o is a novel method that uses a graph-based approach to handle multi-modal information in biomedical domains, including completing missing modalities and aggregating cross-modal data.


On the Equivalence between Logic Programming and SETAF

http://arxiv.org/abs/2407.05538v1

Compressor summary: The paper presents translations between Normal Logic Programs and SETAFs, showing their semantic and structural equivalence, and RFALPs as an expressive subclass of NLPs.


An accurate detection is not all you need to combat label noise in web-noisy datasets

http://arxiv.org/abs/2407.05528v1

Compressor summary: The paper explores unsupervised contrastive learning for detecting out-of-distribution samples in web-crawled data, but finds that it misses some clean examples and proposes a hybrid approach with a small-loss algorithm to improve image classification.


Rethinking Image Skip Connections in StyleGAN2

http://arxiv.org/abs/2407.05527v1

Compressor summary: The paper introduces the image squeeze connection, a new method that improves image synthesis quality and reduces parameters in StyleGAN models, by analyzing and addressing the limitations of the image skip connection technique.


Can Machines Learn the True Probabilities?

http://arxiv.org/abs/2407.05526v1

Compressor summary: AI machines make decisions based on true facts and objective probabilities to achieve the best outcomes.