arxiv compressed, 2024-07-18

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-18 generated by the compressor, my personal LLM-based project.


AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

http://arxiv.org/abs/2407.12784v1

Compressor summary: AgentPoison is a novel backdoor attack that targets long-term memory or RAG knowledge bases of LLM agents, allowing them to be compromised without additional model training or fine-tuning while maintaining normal performance for benign instructions.


SMooDi: Stylized Motion Diffusion Model

http://arxiv.org/abs/2407.12783v1

Compressor summary: SMooDi is a new model that can generate stylized motion for different content and styles, using text guidance and a lightweight adaptor.


Contrastive Adversarial Training for Unsupervised Domain Adaptation

http://arxiv.org/abs/2407.12782v1

Compressor summary: CAT is a novel approach that uses labeled source domain samples to improve feature generation for the target domain in adversarial training, addressing challenges related to robustness, generalization, and alignment.


VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

http://arxiv.org/abs/2407.12781v1

Compressor summary: The text describes a new method that allows controlling camera movement in text-to-video synthesis using transformer-based models and spatiotemporal embeddings.


Generalizable Human Gaussians for Sparse View Synthesis

http://arxiv.org/abs/2407.12777v1

Compressor summary: This paper introduces a new method using Gaussian Splatting to accurately render 3D humans from sparse views, by learning generalizable human Gaussians and leveraging 2D UV space of a template.


OMG-Net: A Deep Learning Framework Deploying Segment Anything to Detect Pan-Cancer Mitotic Figures from Haematoxylin and Eosin-Stained Slides

http://arxiv.org/abs/2407.12773v1

Compressor summary: The study proposes an AI system to automatically detect mitotic figures in cancer images, improving accuracy and consistency in grading and treatment decisions.


LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

http://arxiv.org/abs/2407.12772v1

Compressor summary: The paper introduces a benchmark framework, LMMS-EVAL, for evaluating large multi-modal models, and proposes two new tools, LMMS-EVAL LITE and Multimodal LIVEBENCH, to address the evaluation trilemma of cost, coverage, and contamination.


A survey and taxonomy of methods interpreting random forest models

http://arxiv.org/abs/2407.12759v1

Compressor summary: This paper reviews methods to interpret random forest models and provides a taxonomy of techniques for choosing appropriate tools based on interpretability aspects.


Mutual Information Guided Optimal Transport for Unsupervised Visible-Infrared Person Re-identification

http://arxiv.org/abs/2407.12758v1

Compressor summary: The paper proposes a new unsupervised learning method for cross-modality pedestrian image retrieval, using mutual information, three learning principles, and iterative training with optimal transport assignment and prototype-based contrastive learning.


LookupViT: Compressing visual information to a limited number of tokens

http://arxiv.org/abs/2407.12753v1

Compressor summary: LookupViT is a novel vision transformer block that compresses information from high-resolution tokens to reduce inference cost while maintaining or improving accuracy on various tasks such as image and video classification, and captioning.


HDLCopilot: Hardware Design Library Querying with Natural Language

http://arxiv.org/abs/2407.12749v1

Compressor summary: HDLCopilot is a natural language-based system that helps hardware engineers find information in PDKs faster and more accurately, using an LLM to understand complex queries and provide relevant results.


GroundUp: Rapid Sketch-Based 3D City Massing

http://arxiv.org/abs/2407.12739v1

Compressor summary: GroundUp is a tool that helps architects design 3D urban areas by converting their sketches into 3D models and allowing them to revise quickly.


CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference

http://arxiv.org/abs/2407.12736v1

Compressor summary: CHOSEN is a co-design framework that automates ViT deployment on FPGAs, improving performance by using multi-kernel design, approximate non-linear functions, efficient logic block usage, and a novel compiler algorithm.


EchoSight: Advancing Visual-Language Models with Wiki Knowledge

http://arxiv.org/abs/2407.12735v1

Compressor summary: EchoSight is a multimodal framework that uses retrieval-augmented generation to help large language models answer visual questions requiring encyclopedic knowledge.


A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

http://arxiv.org/abs/2407.12734v1

Compressor summary: The paper introduces a Minecraft builder task benchmark for evaluating large language models' spatial reasoning and vector math skills.


RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models

http://arxiv.org/abs/2407.12730v1

Compressor summary: Uni-Food is a large, unified food dataset for vision-language tasks that includes images, categories, ingredients, recipes, and nutritional information, while RoDE is a novel method to improve LMMs by allocating parameters based on task complexity.


NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model

http://arxiv.org/abs/2407.12727v1

Compressor summary: The paper introduces NL2Contact, a model that generates realistic 3D hand-object contacts from natural language descriptions, and ContactDescribe, a dataset for training the model.


Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?

http://arxiv.org/abs/2407.12725v1

Compressor summary: The text introduces SarcasmCue, a new framework to improve large language models' sarcasm detection by using different prompting strategies that combine sequential and non-sequential methods.


An Evaluation of Continual Learning for Advanced Node Semiconductor Defect Inspection

http://arxiv.org/abs/2407.12724v1

Compressor summary: This paper proposes a meta-learning method for semiconductor defect inspection that adapts to new defect types without forgetting previous knowledge or requiring large datasets.


SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow

http://arxiv.org/abs/2407.12718v1

Compressor summary: SlimFlow is a framework for developing small and efficient one-step diffusion models using rectified flow, addressing challenges like initialization mismatch and distillation issues with Annealing Reflow and Flow-Guided Distillation.


A Unifying Post-Processing Framework for Multi-Objective Learn-to-Defer Problems

http://arxiv.org/abs/2407.12710v1

Compressor summary: The paper proposes a method for developing learn-to-defer systems that work with human experts, optimizing accuracy under various constraints using a generalization of Neyman and Pearson's lemma and showing improved results on COMPAS and ACSIncome datasets.


MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

http://arxiv.org/abs/2407.12709v1

Compressor summary: The paper proposes a method called mixture of multimodal experts (MoME) to improve generalist large language models on vision-language tasks by modulating features and incorporating sparsely gated experts.


IMAGDressing-v1: Customizable Virtual Dressing

http://arxiv.org/abs/2407.12705v1

Compressor summary: The text introduces a new virtual dressing method (IMAGDressing-v1) that allows users to edit and control clothing images in various scenes, using a novel metric (CAMI) and a large dataset (IGPair).


Subgraph-Aware Training of Text-based Methods for Knowledge Graph Completion

http://arxiv.org/abs/2407.12703v1

Compressor summary: The paper proposes a new method for knowledge graph completion that leverages the structural properties of the graphs and improves performance over existing methods.


TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds

http://arxiv.org/abs/2407.12702v1

Compressor summary: TransCAD is a transformer-based model that predicts 3D CAD models from point clouds using a hierarchical learning strategy and a loop refiner, achieving state-of-the-art results with a new metric for CAD sequence evaluation.


Calibrated Diverse Ensemble Entropy Minimization for Robust Test-Time Adaptation in Prostate Cancer Detection

http://arxiv.org/abs/2407.12697v1

Compressor summary: The paper proposes a new method, Diverse Ensemble Entropy Minimization (DEnEM), to improve real-time prostate cancer detection using micro-ultrasound and deep learning, addressing the challenge of data distribution shifts across clinical centers.


4Dynamic: Text-to-4D Generation with Hybrid Priors

http://arxiv.org/abs/2407.12684v1

Compressor summary: Key points: - Text-to-4D generation method using text-to-video diffusion model and reference video supervision - Two stages: static 3D generation and dynamic generation with customized SDS losses and prior-switching strategy - Dynamic modeling representation for deformation and topology changes - Better realism, consistency, and quality than existing methods Summary: The paper presents a text-to-video diffusion method for text-to-4D generation, which uses reference video supervision to ensure realistic and dynamic motion. It combines customized SDS losses and prior-switching strategy for 3D and temporal consistency, and introduces a dynamic modeling representation for shape and topology changes.


In-Situ Infrared Camera Monitoring for Defect and Anomaly Detection in Laser Powder Bed Fusion: Calibration, Data Mapping, and Feature Extraction

http://arxiv.org/abs/2407.12682v1

Compressor summary: The authors propose a new approach to accurately map in-situ data for LPBF defect detection using novel IR features and demonstrate its effectiveness through printing, monitoring, and characterizing various parts.


Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

http://arxiv.org/abs/2407.12679v1

Compressor summary: Goldfish is a method for comprehending videos of any length using efficient retrieval and MiniGPT4-Video, achieving significant improvements in both long and short video understanding.


CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems

http://arxiv.org/abs/2407.12676v1

Compressor summary: The paper proposes a method to improve diffusion model-based inverse problem solving by using a pretrained consistency model and enforcing constraints during the sampling process, achieving high reconstruction quality with fewer inference steps.


Enhancing the Utility of Privacy-Preserving Cancer Classification using Synthetic Data

http://arxiv.org/abs/2407.12669v1

Compressor summary: This paper explores how to use privacy-preserving techniques, such as synthetic data generation and differentially private learning, to improve deep learning models for breast cancer detection from mammography images while protecting patient data.


SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization

http://arxiv.org/abs/2407.12667v1

Compressor summary: Key points: - The paper presents a novel approach to handle noisy camera poses in Neural Radiance Fields (NeRFs) using scene graphs and confidence estimation. - The method uses an IoU loss, a coarse-to-fine strategy, and a new dataset with outlier poses for evaluation. - The results show the superiority of the proposed approach over existing methods in robustness and quality. Summary: The paper proposes a scene graph-based NeRF method that can handle noisy camera poses effectively, using confidence estimation, an IoU loss, and a new dataset.


Patch-Level Training for Large Language Models

http://arxiv.org/abs/2407.12665v1

Compressor summary: The paper proposes patch-level training for large language models, which reduces computational costs by compressing multiple tokens into a single patch and predicting the next patch instead of the next token.


InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction

http://arxiv.org/abs/2407.12661v1

Compressor summary: Key points: - The paper proposes a method to regularize geometric modeling for 3D scene reconstruction from multi-view images - The method encourages mutual information among surface normals of highly correlated scene points - The method uses semantic and geometric features to identify correlated points - The method improves the surface reconstruction quality of SDF-based neural surfaces Summary: The paper introduces a technique that improves 3D scene reconstruction from multi-view images by regularizing geometric modeling with mutual information among normals of correlated points, based on semantic and geometric features.


Fusion Flow-enhanced Graph Pooling Residual Networks for Unmanned Aerial Vehicles Surveillance in Day and Night Dual Visions

http://arxiv.org/abs/2407.12647v1

Compressor summary: The paper proposes a new method, OF-GPRN, that improves UAV detection in dual-vision images using optical fusion and graph-pooling residual network, achieving an 17.9% higher mAP than ResGCN.


Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

http://arxiv.org/abs/2407.12642v1

Compressor summary: Key points: - Text-guided image synthesis has many applications but faces challenges - The proposed approach uses Large Language Models (LLMs) for global coherence and local context understanding - The model expands images conditioned on captions from LLMs and visual features - The model shows superior performance and zero-shot capability with LLM guidance Summary: The paper presents a novel text-guided image synthesis method that uses LLMs to generate coherent and contextualized captions for expanding images in arbitrary sizes.


Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients

http://arxiv.org/abs/2407.12637v1

Compressor summary: The paper proposes an adaptive quantization method for low-bit fixed-point training that minimizes the quantization error for large gradients using an optimal interval and an update algorithm.


CerberusDet: Unified Multi-Task Object Detection

http://arxiv.org/abs/2407.12632v1

Compressor summary: CerberasDet is a YOLO-based multi-headed framework for efficient object detection across multiple tasks and datasets.


A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality

http://arxiv.org/abs/2407.12629v1

Compressor summary: The paper proves that AdaGrad and Adam, two adaptive gradient methods, have linear convergence when the cost function meets the Polyak-{\L}ojasiewicz inequality, using a simple and unified approach for batch and stochastic gradients.


Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

http://arxiv.org/abs/2407.12626v1

Compressor summary: This study explores how domain-specific models and uncertainty estimation affect the entropy of a model's output probability distribution in biomedical applications.


Rethinking the Architecture Design for Efficient Generic Event Boundary Detection

http://arxiv.org/abs/2407.12622v1

Compressor summary: This paper improves generic event boundary detection (GEBD) models by simplifying their architecture, reducing redundancy, and enhancing spatiotemporal learning for faster and more accurate results in real-world applications.


Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences

http://arxiv.org/abs/2407.12620v1

Compressor summary: The text explores AI and NLP applications for preserving endangered Indigenous languages through community engagement, machine translation, and interactive language models.


Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

http://arxiv.org/abs/2407.12616v1

Compressor summary: The text proposes a framework that efficiently adapts unimodal models to handle missing data and predict the missing modality's embedding using self-supervised learning.


Strawberry detection and counting based on YOLOv7 pruning and information based tracking algorithm

http://arxiv.org/abs/2407.12614v1

Compressor summary: This study developed a fast and accurate machine vision system for monitoring strawberry growth and yield using pruned deep learning models and an enhanced object tracking algorithm.


On Diversity in Discriminative Neural Networks

http://arxiv.org/abs/2407.12599v1

Compressor summary: The paper presents a neural network architecture that leverages diversity principles to achieve high accuracy in self-supervised and semi-supervised learning tasks.


Estimate Epidemiological Parameters given Partial Observations based on Algebraically Observable PINNs

http://arxiv.org/abs/2407.12598v1

Compressor summary: The study proposes a method called algebraically observable PINNs to estimate epidemiological parameters using noisy and partial trajectory data.


Enhancing Wrist Abnormality Detection with YOLO: Analysis of State-of-the-art Single-stage Detection Models

http://arxiv.org/abs/2407.12597v1

Compressor summary: The study compares different YOLO models for automated wrist fracture detection and finds that they outperform the commonly used two-stage algorithm, Faster R-CNN, especially for pediatric patients.


VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

http://arxiv.org/abs/2407.12594v1

Compressor summary: VisFocus is an OCR-free method for visual document understanding that uses attention mechanisms and pre-training to focus on relevant text patches based on the language prompt.


EvSign: Sign Language Recognition and Translation with Streaming Events

http://arxiv.org/abs/2407.12593v1

Compressor summary: The text describes using event cameras to improve sign language recognition and translation, introducing a new dataset (EvSign) and an efficient transformer-based framework for these tasks.


VegeDiff: Latent Diffusion Model for Geospatial Vegetation Forecasting

http://arxiv.org/abs/2407.12592v1

Compressor summary: VegeDiff is a diffusion model that probabilistically captures uncertainties in geospatial vegetation forecasting, separately modeling the impacts of dynamic meteorological and static environmental variables, and outperforms existing deterministic methods.


Privacy-Preserving Adaptive Re-Identification without Image Transfer

http://arxiv.org/abs/2407.12589v1

Compressor summary: Fed-Protoid is a novel method for privacy-preserving person re-identification that adapts models on edge devices using distributed source prototypes and minimizes a customized MMD loss.


Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

http://arxiv.org/abs/2407.12582v1

Compressor summary: The paper proposes a novel method for object detection using event cameras and frame cameras, which improves performance and robustness under challenging conditions.


E5-V: Universal Embeddings with Multimodal Large Language Models

http://arxiv.org/abs/2407.12580v1

Compressor summary: E5-V is a framework that adapts large language models for creating universal multimodal embeddings, improving performance and reducing training costs compared to previous approaches.


The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

http://arxiv.org/abs/2407.12579v1

Compressor summary: The paper presents RFBench, a benchmark for evaluating image generation from realistic-fantasy prompts, and RFNet, a training-free method combining diffusion models with LLMs to generate better images.


DP-KAN: Differentially Private Kolmogorov-Arnold Networks

http://arxiv.org/abs/2407.12569v1

Compressor summary: Kolmogorov-Arnold Network can be trained privately and performs similarly to Multilayer Perceptron in differentially private settings.


LTRL: Boosting Long-tail Recognition via Reflective Learning

http://arxiv.org/abs/2407.12568v1

Compressor summary: Reflecting learning is a novel learning paradigm that uses reviewing, summarizing, and correcting processes to improve long-tail recognition performance.


End-to-end Stroke imaging analysis, using reservoir computing-based effective connectivity, and interpretable Artificial intelligence

http://arxiv.org/abs/2407.12553v1

Compressor summary: The paper proposes a pipeline using reservoir computing and directed graph analysis for efficient brain representation in stroke data derived from MRI, enabling classification of effective connectivity and interpretation of disrupted networks with explainable AI tools.


UniTE: A Survey and Unified Pipeline for Pre-training ST Trajectory Embeddings

http://arxiv.org/abs/2407.12550v1

Compressor summary: UniTE is a survey and a unified pipeline for pre-training trajectory embeddings that simplifies their development and analysis.


Abstraction Alignment: Comparing Model and Human Conceptual Relationships

http://arxiv.org/abs/2407.12543v1

Compressor summary: Abstraction alignment is a method to measure how well an ML model's learned representations match human-expected abstractions using a human abstraction graph.


Towards Collaborative Intelligence: Propagating Intentions and Reasoning for Multi-Agent Coordination with Large Language Models

http://arxiv.org/abs/2407.12532v1

Compressor summary: The framework trains large language models as collaborative agents for coordinated behaviors in cooperative MARL by sharing private intentions, adapting comprehension strategies, and dynamically re-planning sub-tasks.


Crafting the Path: Robust Query Rewriting for Information Retrieval

http://arxiv.org/abs/2407.12529v1

Compressor summary: Crafting the Path is a novel structured query rewriting method that improves information retrieval by generating relevant queries using a three-step process and is less dependent on Large Language Models' internal knowledge.


On the Complexity of Identification in Linear Structural Causal Models

http://arxiv.org/abs/2407.12528v1

Compressor summary: The paper proposes a new algorithm for identifying causal parameters in linear structural models from observational data and proves that the identification task is computationally hard in general.


Struct-X: Enhancing Large Language Models Reasoning with Structured Data

http://arxiv.org/abs/2407.12522v1

Compressor summary: Struct-X is a framework that helps large language models use structured data more effectively by encoding it into a topological space and guiding them through five phases to enhance reasoning abilities.


Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition

http://arxiv.org/abs/2407.12519v1

Compressor summary: CLTD is a new method for gait recognition that uses attention and projection to separate identity features from non-identity clues in spatial, temporal, and spectral domains.


Evaluating the transferability potential of deep learning models for climate downscaling

http://arxiv.org/abs/2407.12517v1

Compressor summary: The paper evaluates how well deep learning models can learn from diverse climate data and transfer their knowledge across different tasks, locations, and variables.


On Initializing Transformers with Pre-trained Embeddings

http://arxiv.org/abs/2407.12514v1

Compressor summary: The paper explores why random initialization schemes outperform some pre-trained word and sub-word embeddings in transformer models, and suggests standardizing the embeddings' values as a solution.


$\textit{GeoHard}$: Towards Measuring Class-wise Hardness through Modelling Class Semantics

http://arxiv.org/abs/2407.12512v1

Compressor summary: This paper introduces class-wise hardness and proposes GeoHard, a metric that measures the difficulty of different classes in natural language understanding tasks by analyzing their semantic embeddings.


Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations

http://arxiv.org/abs/2407.12511v1

Compressor summary: CoLIE is a novel approach that enhances low-light images by mapping coordinates to illumination components and reducing computational overhead, making it more adaptable and practical for various scenes and tasks.


MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

http://arxiv.org/abs/2407.12508v1

Compressor summary: MERLIN is a system that uses large language models to improve text-video retrieval by aligning user queries with video content.


Subequivariant Reinforcement Learning in 3D Multi-Entity Physical Environments

http://arxiv.org/abs/2407.12505v1

Compressor summary: This paper introduces Subequivariant Hierarchical Neural Networks (SHNN) for learning policies in complex 3D multi-entity systems, using local entity-level graphs and subequivariant message passing, and proposes a new benchmark (MEBEN) to evaluate them.


Case2Code: Learning Inductive Reasoning with Synthetic Data

http://arxiv.org/abs/2407.12504v1

Compressor summary: The paper proposes a Case2Code task to evaluate and teach large language models (LLMs) inductive reasoning by synthesizing input-output transformations for executable programs and training LLMs on these synthetic cases.


EmoFace: Audio-driven Emotional 3D Face Animation

http://arxiv.org/abs/2407.12501v1

Compressor summary: EmoFace is a novel audio-driven method for creating emotionally expressive 3D face animations with natural blinks, eye movements, and lip synchronization, and it comes with a new emotional dataset for MetaHuman models.


Automate or Assist? The Role of Computational Models in Identifying Gendered Discourse in US Capital Trial Transcripts

http://arxiv.org/abs/2407.12500v1

Compressor summary: The paper presents a case study of using automated systems to identify gender-biased language in US capital trials for women defendants, finding that they can help lawyers challenge their bias and refine annotation rules, but cannot replace human expertise.


Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

http://arxiv.org/abs/2407.12498v1

Compressor summary: This study evaluates how well large language models perform on a benchmark with different learning methods and shows that they improve when using image captions or interleaved data, and few-shot learning.


Test-Time Adaptation with State-Space Models

http://arxiv.org/abs/2407.12492v1

Compressor summary: The paper proposes a probabilistic model that adapts a deployed model to distribution shifts by learning hidden feature dynamics and class prototypes without labels or model backbone access.


Hierarchical and Decoupled BEV Perception Learning Framework for Autonomous Driving

http://arxiv.org/abs/2407.12491v1

Compressor summary: The paper proposes a new hierarchical BEV perception paradigm for autonomous driving systems, using a library of modules and a user-friendly interface to solve challenges like lengthy development cycles and poor reusability.


Towards AI-Powered Video Assistant Referee System for Association Football

http://arxiv.org/abs/2407.12483v1

Compressor summary: The paper introduces a semi-automated system, VARS, that improves football fairness and accuracy by analyzing multi-view videos of fouls and suggesting sanctions.


Pretraining Data and Tokenizer for Indic LLM

http://arxiv.org/abs/2407.12481v1

Compressor summary: Key points: - A novel approach to data preparation for multilingual Indic large language model - Data from open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia - Custom preprocessing pipeline for each Indic language to eliminate redundant and low-quality text content - Deduplication on Common Crawl data to address redundancy in web pages - A novel multilingual tokenizer training strategy that outperforms OpenAI Tiktoken tokenizer for Indic languages Summary: The authors propose a new method to prepare data for a multilingual Indic large language model using diverse and rich sources, custom preprocessing, deduplication, and a novel tokenizer training strategy.


A Novel Dependency Framework for Enhancing Discourse Data Analysis

http://arxiv.org/abs/2407.12473v1

Compressor summary: The study uses refined BERT-based parsers to convert PDTB and RST annotations into dependencies, enabling unified analysis of different discourse corpora across languages.


Continual Learning for Temporal-Sensitive Question Answering

http://arxiv.org/abs/2407.12470v1

Compressor summary: This paper proposes a continual learning framework for question answering with temporal memory and contrastive learning, and creates a new dataset to support the research.


Estimating Reaction Barriers with Deep Reinforcement Learning

http://arxiv.org/abs/2407.12453v1

Compressor summary: The authors propose using reinforcement learning to find the cheapest way for a system to transition between stable states in its energy landscape.


Energy-Guided Diffusion Sampling for Offline-to-Online Reinforcement Learning

http://arxiv.org/abs/2407.12448v1

Compressor summary: EDIS improves offline-to-online RL by using a diffusion model to extract prior knowledge from offline data and energy functions to generate better online data.


A Comprehensive Sustainable Framework for Machine Learning and Artificial Intelligence

http://arxiv.org/abs/2407.12445v1

Compressor summary: The paper presents a new framework for Sustainable Machine Learning that considers fairness, privacy, interpretability, and greenhouse gas emissions, and proposes a meta-learning algorithm to help users select optimal model architectures based on their requirements.


GraphGuard: Contrastive Self-Supervised Learning for Credit-Card Fraud Detection in Multi-Relational Dynamic Graphs

http://arxiv.org/abs/2407.12440v1

Compressor summary: GraphGuard is a new method that uses graphs and self-supervised learning to detect credit card fraud better than existing methods.


Semantic-Aware Representation of Multi-Modal Data for Data Ingress: A Literature Review

http://arxiv.org/abs/2407.12438v1

Compressor summary: The study explores how semantic-aware embeddings can improve information retrieval from large, diverse, and temporal data lakes in various application domains.


Variable-Agnostic Causal Exploration for Reinforcement Learning

http://arxiv.org/abs/2407.12437v1

Compressor summary: VACERL is a framework that uses causal relationships to guide exploration in RL without assuming environmental causal variables, improving agent performance especially in challenging domains.


F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

http://arxiv.org/abs/2407.12435v1

Compressor summary: The paper introduces Semantic-HOI, a new dataset for 3D human object interaction, and proposes F-HOI, a unified model that leverages multimodal instructions to handle diverse HOI tasks.


GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval

http://arxiv.org/abs/2407.12431v1

Compressor summary: The paper proposes GLARE, a new Low-Light Image Enhancement network that uses a codebook prior derived from normal-light images and a generative module to align low-light features with normal-light latent representations, resulting in improved performance on various benchmarks and real-world data.


GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features

http://arxiv.org/abs/2407.12427v1

Compressor summary: The paper presents GeneralAD, a framework that detects and generates semantic, near-distribution, and industrial anomalies using Vision Transformers and attention-based discriminators, achieving high performance across various datasets.


Sharif-STR at SemEval-2024 Task 1: Transformer as a Regression Model for Fine-Grained Scoring of Textual Semantic Relations

http://arxiv.org/abs/2407.12426v1

Compressor summary: The paper explores using fine-tuning techniques on RoBERTa to improve semantic textual relatedness across different languages, with promising results in Latin languages but challenges in Arabic.


Navigating the Noisy Crowd: Finding Key Information for Claim Verification

http://arxiv.org/abs/2407.12425v1

Compressor summary: EACon is a framework that helps verify claims by abstracting evidence and deconstructing the claim into subclaims, improving the performance of large language models.


SafePowerGraph: Safety-aware Evaluation of Graph Neural Networks for Transmission Power Grids

http://arxiv.org/abs/2407.12421v1

Compressor summary: Key points: - Power grids are complex infrastructures that need effective analysis methods - Machine learning techniques, especially Graph Neural Networks (GNNs), can help with grid analysis problems - Existing benchmarks and datasets do not consider safety and robustness requirements for power grids - SafePowerGraph is a new framework and benchmark that integrates multiple simulators and assesses GNN performance under realistic scenarios - Self-supervised learning and graph attention architectures are important for GNN robustness Summary: SafePowerGraph is a novel safety-oriented framework and benchmark for evaluating Graph Neural Networks in power grid analysis, using multiple simulators and diverse scenarios.


Dirac--Bianconi Graph Neural Networks -- Enabling Non-Diffusive Long-Range Graph Predictions

http://arxiv.org/abs/2407.12419v1

Compressor summary: DBGNNs are a new type of graph neural network that uses the topological Dirac equation to capture complex graph dynamics and outperforms conventional MPNNs for long-range predictions.


Improving the classification of extreme classes by means of loss regularisation and generalised beta distributions

http://arxiv.org/abs/2407.12417v1

Compressor summary: The paper proposes a unimodal regularisation method that improves classification of extreme classes in ordinal problems using the generalised beta distribution and shows superior results compared to other methods.


Not All Frequencies Are Created Equal:Towards a Dynamic Fusion of Frequencies in Time-Series Forecasting

http://arxiv.org/abs/2407.12415v1

Compressor summary: Frequency Dynamic Fusion (FreDF) is a novel time series forecasting method that captures long-term dependency by predicting and fusing different frequencies in the Fourier domain, adapting to various scenarios.


Analyzing the Generalization and Reliability of Steering Vectors -- ICML 2024

http://arxiv.org/abs/2407.12404v1

Compressor summary: The paper investigates the reliability and generalization properties of steering vectors for language models, finding that they have limitations in terms of variable effectiveness and brittleness.


TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

http://arxiv.org/abs/2407.12402v1

Compressor summary: The paper introduces TurkishMMLU, a multitask, multiple-choice Turkish QA benchmark to evaluate LLMs' understanding of the Turkish language across various subjects in high-school curricula.


Geometric Remove-and-Retrain (GOAR): Coordinate-Invariant eXplainable AI Assessment

http://arxiv.org/abs/2407.12401v1

Compressor summary: The paper proposes GOAR, a geometric feature-perturbation approach for XAI that overcomes the limitations of pixel-perturbation methods like ROAR.


A Practical Solver for Scalar Data Topological Simplification

http://arxiv.org/abs/2407.12399v1

Compressor summary: The paper proposes an optimization method for simplifying scalar data that preserves important features and can handle different topological structures, making it practical for real-life datasets and improving analysis and visualization.


Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

http://arxiv.org/abs/2407.12397v1

Compressor summary: This paper explores how post-training quantization affects recurrent language models, identifying activation outliers as a challenge similar to transformer-based models.


Efficient Depth-Guided Urban View Synthesis

http://arxiv.org/abs/2407.12395v1

Compressor summary: EDUS is a new method for fast and efficient urban view synthesis from sparse input images using noisy predicted geometric priors.


PersLLM: A Personified Training Approach for Large Language Models

http://arxiv.org/abs/2407.12393v1

Compressor summary: This study proposes PersLLM, a method to integrate psychology-grounded principles of personality into large language models, enhancing their utility in domains like social simulations and human-machine interactions.


Morphosyntactic Analysis for CHILDES

http://arxiv.org/abs/2407.12389v1

Compressor summary: The text discusses how AI and ML advances enable researchers to compare language development across 27 languages using new transcription and analysis methods.


Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

http://arxiv.org/abs/2407.12383v1

Compressor summary: RECE is a novel approach that efficiently modifies text-to-image models to erase inappropriate concepts without compromising their generation ability or requiring additional fine-tuning.


Deep Learning-based Sentiment Analysis of Olympics Tweets

http://arxiv.org/abs/2407.12376v1

Compressor summary: The study develops an advanced deep learning model for sentiment analysis of Olympic-related tweets, achieving the highest accuracy with the BERT model.


FETCH: A Memory-Efficient Replay Approach for Continual Learning in Image Classification

http://arxiv.org/abs/2407.12375v1

Compressor summary: FETCH is a two-stage compression method that improves accuracy in class-incremental continual learning using compressed replay with GDumb.


HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects

http://arxiv.org/abs/2407.12371v1

Compressor summary: The HIMO dataset provides a large collection of full-body human interactions with multiple objects, along with textual descriptions and temporal segments, for training models on two novel tasks: HOI synthesis and fine-grained timeline control.


Temporal receptive field in dynamic graph learning: A comprehensive analysis

http://arxiv.org/abs/2407.12370v1

Compressor summary: This study analyzes the temporal receptive field in dynamic graph learning, showing its importance for accurate prediction in evolving networks.


NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

http://arxiv.org/abs/2407.12366v1

Compressor summary: The authors propose a method to bridge the gap between large language models and vision-and-language navigation tasks by aligning visual content in a frozen language model, enabling better integration of language and navigation policy networks.


Conversational Query Reformulation with the Guidance of Retrieved Documents

http://arxiv.org/abs/2407.12363v1

Compressor summary: GuideCQR is a framework that improves conversational search by using guided documents to refine queries, generate expected answers, and filter results, outperforming previous methods and LLM prompts.


ProcTag: Process Tagging for Assessing the Efficacy of Document Instruction Data

http://arxiv.org/abs/2407.12358v1

Compressor summary: ProcTag is a data-oriented method that evaluates document instruction datasets by tagging the execution process of instructions, enabling selective sampling or filtering for training large language models on document visual question answering tasks.


Evaluating graph-based explanations for AI-based recommender systems

http://arxiv.org/abs/2407.12357v1

Compressor summary: Graph-based explanations improve usability but not understanding for AI recommendations compared to textual explanations.


LTSim: Layout Transportation-based Similarity Measure for Evaluating Layout Generation

http://arxiv.org/abs/2407.12356v1

Compressor summary: The paper introduces a new layout similarity measure based on optimal transport, which can handle various layout differences and is applicable to many layout generation tasks.


Invertible Neural Warp for NeRF

http://arxiv.org/abs/2407.12354v1

Compressor summary: The paper presents a new method for optimizing camera pose and NeRF using overparameterized representations, rigid warp functions, and invertible neural networks.


Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

http://arxiv.org/abs/2407.12346v1

Compressor summary: The proposed object-aware query perturbation framework improves cross-modal image-text retrieval for small objects by generating a key feature subspace of the detected objects and perturbing the queries using this subspace.


VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

http://arxiv.org/abs/2407.12345v1

Compressor summary: Key points: - The paper proposes a novel method for trajectory prediction that uses visual inputs from surround-view cameras and textual descriptions generated by VLM and refined by LLM. - The method achieves a low latency of 53 ms, making it feasible for real-time processing. - The method outperforms previous methods with similar performance and creates a new dataset (nuScenes-Text) with rich textual annotations. Summary: The paper presents a fast and accurate trajectory prediction method that leverages visual and textual cues from surround-view cameras and VLM/LLM-generated descriptions, and introduces a new dataset with these annotations.


The Better Angels of Machine Personality: How Personality Relates to LLM Safety

http://arxiv.org/abs/2407.12344v1

Compressor summary: The paper explores how personality traits affect LLM safety abilities and shows that modifying these traits can improve their performance in various aspects.


Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection

http://arxiv.org/abs/2407.12342v1

Compressor summary: WordFS is a new method to reduce word embedding dimensions while maintaining efficiency and effectiveness in various natural language processing tasks.


M2DS: Multilingual Dataset for Multi-document Summarisation

http://arxiv.org/abs/2407.12336v1

Compressor summary: The text discusses the need for a multilingual dataset for multi-document summarization (M2DS) in today's globalized digital world, which is provided by BBC articles in five languages.


Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition

http://arxiv.org/abs/2407.12332v1

Compressor summary: The paper explains why some models can generalize well on modular addition problem even after overfitting, by transitioning from kernel-like to gradient descent-like behavior.


I2AM: Interpreting Image-to-Image Latent Diffusion Models via Attribution Maps

http://arxiv.org/abs/2407.12331v1

Compressor summary: The paper proposes I2AM, a method to enhance interpretability of image generation models by aggregating patch-level cross-attention scores, enabling detailed attribution analysis and evaluation for reference-based image inpainting tasks.


Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

http://arxiv.org/abs/2407.12330v1

Compressor summary: This paper proposes an energy model-based instance-wise calibration method for deep neural networks to improve their uncertainty estimation and reliability in multi-class classification tasks, especially for out-of-distribution inputs.


Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

http://arxiv.org/abs/2407.12327v1

Compressor summary: The Spectra LLM suite explores low-bitwidth language models and their performance, training dynamics, and scaling trends, with promising results in size reduction and commonsense reasoning, but challenges in toxicity and perplexity.


Out of Length Text Recognition with Sub-String Matching

http://arxiv.org/abs/2407.12317v1

Compressor summary: The paper proposes a novel method for long text recognition, called OOL Text Recognition with sub-String Matching (SMTR), which uses cross-attention and regularization training to handle arbitrary length text.


ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

http://arxiv.org/abs/2407.12315v1

Compressor summary: ModalChorus is an interactive system for visual probing and alignment of multi-modal embeddings, using a two-stage process with Modal Fusion Map and embedding alignment to enhance modality fusion and intention articulation.


MEDFuse: Multimodal EHR Data Fusion with Masked Lab-Test Modeling and Large Language Models

http://arxiv.org/abs/2407.12309v1

Compressor summary: MEDFuse is a framework that fuses structured and unstructured EHR data using multimodal embeddings to improve clinical decision-making, achieving high performance in multi-label classification tasks.


Weakly-Supervised 3D Hand Reconstruction with Knowledge Prior and Uncertainty Guidance

http://arxiv.org/abs/2407.12307v1

Compressor summary: The text introduces a weakly-supervised method for 3D hand reconstruction that uses hand knowledge from different sources and uncertainty modeling to train with 2D landmark annotations, improving performance over existing methods.


Splatfacto-W: A Nerfstudio Implementation of Gaussian Splatting for Unconstrained Photo Collections

http://arxiv.org/abs/2407.12306v1

Compressor summary: Splatfacto-W is a novel view synthesis method that improves scene consistency in unconstrained images by incorporating per-Gaussian neural color features, per-image appearance embeddings, and a spherical harmonics-based background model.


Exploiting Inter-Image Similarity Prior for Low-Bitrate Remote Sensing Image Compression

http://arxiv.org/abs/2407.12295v1

Compressor summary: The paper proposes a codebook-based remote sensing image compression method that leverages VQGAN to generate a discrete codebook and uses a Transformer-based prediction model and a hierarchical prior integration network to enhance the decoding performance.


Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection

http://arxiv.org/abs/2407.12292v1

Compressor summary: The GAKer method generates adversarial examples that can fool deep neural networks into recognizing any image as a target object, improving attack success rates for both known and unknown classes.


JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

http://arxiv.org/abs/2407.12291v1

Compressor summary: Joint Score Distillation (JSD) improves text-to-3D generation by considering coherence among views and using energy functions to capture view-aware information.


Chip Placement with Diffusion

http://arxiv.org/abs/2407.12282v1

Compressor summary: The authors propose a diffusion model and a novel architecture for macro placement in digital circuits, which achieves competitive performance and trains at scale using synthetic datasets.


ER-FSL: Experience Replay with Feature Subspace Learning for Online Continual Learning

http://arxiv.org/abs/2407.12279v1

Compressor summary: ER-FSL is a novel online continual learning method that uses multiple feature subspaces to learn new data and replays old data in a larger feature space to prevent catastrophic forgetting.


Multimodal Reranking for Knowledge-Intensive Visual Question Answering

http://arxiv.org/abs/2407.12277v1

Compressor summary: The paper proposes a multi-modal reranker for visual question answering that uses cross-item interaction to improve ranking quality and relevance score modeling of knowledge candidates.


When can transformers compositionally generalize in-context?

http://arxiv.org/abs/2407.12275v1

Compressor summary: The text discusses how transformers can compose tasks from independent components, but struggle to generalize compositionally unless there's a bottleneck separating task inference and execution.


GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity

http://arxiv.org/abs/2407.12273v1

Compressor summary: GRIDS is a new image restoration method that groups similar degradations and improves efficiency and effectiveness of restoration.


RBAD: A Dataset and Benchmark for Retinal Vessels Branching Angle Detection

http://arxiv.org/abs/2407.12271v1

Compressor summary: The paper presents a new method to accurately detect retinal branching angles using image processing and provides an open-source annotation tool and a benchmark dataset for evaluation.


UTG: Towards a Unified View of Snapshot and Event Based Models for Temporal Graphs

http://arxiv.org/abs/2407.12269v1

Compressor summary: Unified Temporal Graph (UTG) is a framework that combines snapshot- and event-based machine learning models for temporal graphs, improving their performance and efficiency.


Generating 3D House Wireframes with Semantics

http://arxiv.org/abs/2407.12267v1

Compressor summary: Key points: - New approach for generating 3D house wireframes with semantic enrichment using an autoregressive model - Unified wire-based representation for improved coherence and semantic integration - Graph-based autoencoder and transformer-based decoder for learning geometric tokens and generating wireframes - Iterative prediction and decoding for detailed wireframes that can be segmented into components Summary: The authors propose a novel autoregressive model that uses a unified wire-based representation to generate 3D house wireframes with semantic enrichment, achieving superior accuracy and novelty.


In-Context Probing Approximates Influence Function for Data Valuation

http://arxiv.org/abs/2407.12259v1

Compressor summary: In-context probing is a useful method for valuing and selecting training data for large language models, as it approximates the influence functions that estimate the contribution of data to model predictions.


Facial Affect Recognition based on Multi Architecture Encoder and Feature Fusion for the ABAW7 Challenge

http://arxiv.org/abs/2407.12258v1

Compressor summary: The paper describes a method for facial expression analysis using Transformer Encoder and visual features, achieving better results than previous methods in a competition.


Compound Expression Recognition via Multi Model Ensemble for the ABAW7 Challenge

http://arxiv.org/abs/2407.12257v1

Compressor summary: The paper proposes an ensemble learning method using different neural networks to recognize complex human emotional expressions accurately.


Dual-Hybrid Attention Network for Specular Highlight Removal

http://arxiv.org/abs/2407.12255v1

Compressor summary: The paper proposes a novel end-to-end network (DHAN-SHR) that uses hybrid attention mechanisms to remove specular highlights from images without relying on additional priors or supervision, achieving state-of-the-art performance and introducing a large-scale benchmark dataset.


COKE: Causal Discovery with Chronological Order and Expert Knowledge in High Proportion of Missing Manufacturing Data

http://arxiv.org/abs/2407.12254v1

Compressor summary: COKE is a method that uses expert knowledge and chronological order to construct causal graphs for fault diagnosis and optimization in manufacturing processes without imputing missing data, achieving significant improvement in F1-score compared to benchmark methods.


Lacuna Language Learning: Leveraging RNNs for Ranked Text Completion in Digitized Coptic Manuscripts

http://arxiv.org/abs/2407.12247v1

Compressor summary: The paper proposes a bidirectional RNN model for predicting missing characters in ancient manuscripts, which can help scholars rank possible reconstructions but not reconstruct the text definitively.


Explaining Deep Neural Networks by Leveraging Intrinsic Methods

http://arxiv.org/abs/2407.12243v1

Compressor summary: The thesis aims to improve interpretability of deep neural networks by introducing self-explanatory designs, studying neuron activation phenomena, and analyzing visual analytics applications.


Adaptive Cascading Network for Continual Test-Time Adaptation

http://arxiv.org/abs/2407.12240v1

Compressor summary: Our method adapts a pre-trained model to different unlabelled domains at test time by updating both feature extractor and classifier using meta-learning and cascading, with new evaluation metrics for real-world scenarios.


Motion and Structure from Event-based Normal Flow

http://arxiv.org/abs/2407.12239v1

Compressor summary: The text describes a new approach to solve computer vision problems using event cameras, which are more challenging than standard cameras and have better performance in certain scenarios.


Urban Traffic Forecasting with Integrated Travel Time and Data Availability in a Conformal Graph Neural Network Framework

http://arxiv.org/abs/2407.12238v1

Compressor summary: The study proposes a new framework using Graph Neural Networks and Adaptive Conformal Prediction to improve traffic flow prediction, outperforming existing models.


Base Models for Parabolic Partial Differential Equations

http://arxiv.org/abs/2407.12234v1

Compressor summary: The text proposes a meta-learning framework for efficiently solving parabolic PDEs across different scenarios by learning from existing simulations.


Conditional Quantile Estimation for Uncertain Watch Time in Short-Video Recommendation

http://arxiv.org/abs/2407.12223v1

Compressor summary: The paper proposes Conditional Quantile Estimation (CQE), a novel technique that uses quantile regression to capture the nuanced distribution of watch time in short video recommendation, improving accuracy and robustness.


Questionable practices in machine learning

http://arxiv.org/abs/2407.12220v1

Compressor summary: The text discusses the prevalence of questionable research practices in evaluating large language models and their negative impact on reproducibility.