arxiv compressed, 2024-08-22

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-08-22 generated by the compressor, my personal LLM-based project.


GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

http://arxiv.org/abs/2408.11817v1

Compressor summary: GRAB is a synthetic graph analysis benchmark for testing large multimodal models' capabilities in interpreting figures and estimating properties of graphs.


Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction

http://arxiv.org/abs/2408.11816v1

Compressor summary: The paper explores whether an object-centric mapping helps agents learn efficiently in reinforcement learning and proposes a hierarchical model-based algorithm that outperforms existing methods.


Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

http://arxiv.org/abs/2408.11815v1

Compressor summary: $K$-nearest neighbor language models perform well on memory-intensive tasks but struggle with multi-hop reasoning.


SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

http://arxiv.org/abs/2408.11813v1

Compressor summary: The text introduces Supervised Embedding Alignment (SEA), a method that improves the integration of visual and language representations in Multimodal Large Language Models (MLLMs) using contrastive learning, leading to better performance and interpretability without additional data or computation.


Pixel Is Not A Barrier: An Effective Evasion Attack for Pixel-Domain Diffusion Models

http://arxiv.org/abs/2408.11810v1

Compressor summary: The text proposes a novel attack method for image editing based on diffusion models, which can bypass previous defenses and exploit vulnerabilities in both pixel-domain and latent-domain models.


Approaching Deep Learning through the Spectral Dynamics of Weights

http://arxiv.org/abs/2408.11804v1

Compressor summary: The text proposes an empirical approach using weight dynamics in deep learning to explain various phenomena, such as optimization bias, memorizing vs. generalizing networks, and sparse subnetworks.


Story3D-Agent: Exploring 3D Storytelling Visualization with Large Language Models

http://arxiv.org/abs/2408.11801v1

Compressor summary: The paper introduces Story3D-Agent, a method that uses large language models to create 3D visualizations of stories with precise control and logical reasoning.


PermitQA: A Benchmark for Retrieval Augmented Generation in Wind Siting and Permitting domain

http://arxiv.org/abs/2408.11800v1

Compressor summary: The paper introduces a framework for creating domain-specific benchmarks to evaluate Retrieval Augmented Generation (RAG) in Natural Language Processing, using Human-AI teaming and a case study on wind energy permitting.


Practical token pruning for foundation models in few-shot conversational virtual assistant systems

http://arxiv.org/abs/2408.11799v1

Compressor summary: Key points: - The paper proposes a method to improve an enterprise VA system's intent classification with contrastive learning and multi-task adaptation. - The method achieves high accuracy even with few training samples and outperforms commercial solutions. - The method also speeds up the inference time by pruning tokens dynamically without extra training. Summary: The paper presents a fast and accurate intent classification method for enterprise VA systems using contrastive learning, multi-task adaptation, and dynamic token pruning.


LLM Pruning and Distillation in Practice: The Minitron Approach

http://arxiv.org/abs/2408.11796v1

Compressor summary: The authors compress large language models using pruning and distillation, achieving high performance on benchmarks and aligning them with NeMo Aligner for instruct-tuned use.


EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

http://arxiv.org/abs/2408.11795v1

Compressor summary: The paper introduces EE-MLLM, a multimodal language model that balances data and compute efficiency by modifying the self-attention mechanism to enable both computational and weight reuse advantages.


Leveraging Chemistry Foundation Models to Facilitate Structure Focused Retrieval Augmented Generation in Multi-Agent Workflows for Catalyst and Materials Design

http://arxiv.org/abs/2408.11793v1

Compressor summary: The text describes using large language models to predict molecular properties, generate materials, and retrieve relevant chemistry information for various tasks.


Critique-out-Loud Reward Models

http://arxiv.org/abs/2408.11791v1

Compressor summary: CLoud reward models generate critiques of assistant responses and use them to predict rewards for quality, improving preference classification accuracy and win rate in RLHF.


DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework

http://arxiv.org/abs/2408.11788v1

Compressor summary: DreamFactory is a framework that uses language models and multi-agent collaboration to create long, coherent, and stylistic videos with novel evaluation metrics and a new dataset.


Timeline and Boundary Guided Diffusion Network for Video Shadow Detection

http://arxiv.org/abs/2408.11785v1

Compressor summary: Key points: - TBGDiff is a novel network for video shadow detection that considers temporal and boundary information. - It uses DSA to aggregate long-term and short-term frames, SBAA to attend to shadow boundaries, and Diffusion with STEE to guide the process. - It outperforms state-of-the-art methods in experiments and is publicly available at GitHub. Summary: The paper presents TBGDiff, a new network for detecting shadows in videos that leverages temporal and boundary cues, and achieves state-of-the-art results.


Personality Alignment of Large Language Models

http://arxiv.org/abs/2408.11779v1

Compressor summary: Personality Alignment (PA) is a method to tailor large language models' responses based on individual users' behavioral preferences, using the PAPI dataset and an optimization technique that improves efficiency and relevance of AI interactions.


Sum of Squares Circuits

http://arxiv.org/abs/2408.11778v1

Compressor summary: This paper analyzes expressive generative models and introduces a novel class of probabilistic circuits called sum of squares PCs that can be exponentially more powerful than existing ones.


Leveraging Fine-Tuned Retrieval-Augmented Generation with Long-Context Support: For 3GPP Standards

http://arxiv.org/abs/2408.11775v1

Compressor summary: The paper proposes a fine-tuned system using a small language model (Phi-2) to help communicate technical standards effectively by processing diverse document formats and adapting context windows.


Embedding Ordinality to Binary Loss Function for Improving Solar Flare Forecasting

http://arxiv.org/abs/2408.11768v1

Compressor summary: The paper proposes a new loss function for binary flare prediction that considers ordinal flare characteristics and improves the performance of a ResNet34-based model using magnetogram features.


SBDet: A Symmetry-Breaking Object Detector via Relaxed Rotation-Equivariance

http://arxiv.org/abs/2408.11760v1

Compressor summary: R2GConv and SBDet use a relaxed rotation-equivariant group to handle symmetry-breaking and non-rigid transformations in visual data.


MambaCSR: Dual-Interleaved Scanning for Compressed Image Super-Resolution With SSMs

http://arxiv.org/abs/2408.11758v1

Compressor summary: MambaCSR is a framework that uses dual-interleaved scanning and position-aligned cross-scale scanning to effectively restore compressed images with contextual information.


Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks

http://arxiv.org/abs/2408.11749v1

Compressor summary: The paper explores the susceptibility of multilingual LLMs to embedding inversion attacks across various languages, scripts, and language families, and identifies patterns that could help attackers improve their methods.


DH-Bench: Probing Depth and Height Perception of Large Visual-Language Models

http://arxiv.org/abs/2408.11748v1

Compressor summary: The paper evaluates the geometric comprehension of large Vision Language Models (VLMs) in depth and height perception, finding that they consistently struggle with these aspects, and introduces benchmark datasets to improve their capabilities.


Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

http://arxiv.org/abs/2408.11746v1

Compressor summary: Mixed Sparsity Training (MST) is a pretraining method for large language models that reduces computational demands by up to 75% while maintaining performance.


FocusLLM: Scaling LLM's Context by Parallel Decoding

http://arxiv.org/abs/2408.11745v1

Compressor summary: FocusLLM extends the context length of decoder-only large language models by dividing and appending chunks as prompts for parallel decoding, improving performance on long-context tasks with less training cost.


JieHua Paintings Style Feature Extracting Model using Stable Diffusion with ControlNet

http://arxiv.org/abs/2408.11744v1

Compressor summary: The study uses a Fine-tuned Stable Diffusion Model with ControlNet (FSDMC) to refine depiction techniques from Jiehua artists and outperforms CycleGAN in style transfer.


MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

http://arxiv.org/abs/2408.11743v1

Compressor summary: MARLIN is a technique that speeds up large language model inference on GPUs by efficiently handling batched workloads with quantization and various other optimizations.


CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering

http://arxiv.org/abs/2408.11742v1

Compressor summary: The proposed CluMo method uses a novel prompt-based approach with key-key-prompt pairs and clustering to improve generalization capacity and prevent catastrophic forgetting in multimodal continual learning for vision-language models.


Clinical Insights: A Comprehensive Review of Language Models in Medicine

http://arxiv.org/abs/2408.11735v1

Compressor summary: The paper examines large language models' advancements, applications, and challenges in the healthcare sector, emphasizing clinical efficiency, ethics, data privacy, and open-source models.


Iterative Object Count Optimization for Text-to-image Diffusion Models

http://arxiv.org/abs/2408.11721v1

Compressor summary: Key points: - Text-to-image models struggle with counting a specified number of objects - Proposed method optimizes image generation based on a counting loss from an out-of-the-box counting model - Method has three advantages: (i) considers non-derivable counting techniques, (ii) is plug-and-play, and (iii) reuses the optimized counting token for image generation Summary: The authors propose a method to improve text-to-image models by using a counting loss from an external model and adapting it online, which offers advantages in flexibility and accuracy.


On Learnable Parameters of Optimal and Suboptimal Deep Learning Models

http://arxiv.org/abs/2408.11720v1

Compressor summary: The text analyzes how weight patterns in deep learning models affect performance across various datasets and model architectures, finding that successful networks share similar weight statistics and distribution.


ControlCol: Controllability in Automatic Speaker Video Colorization

http://arxiv.org/abs/2408.11711v1

Compressor summary: ControlCol is an automatic video colorization system that gives users control over the process and outperforms current techniques in quality and preference.


FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

http://arxiv.org/abs/2408.11706v1

Compressor summary: FRAP is a simple approach that adapts token weights in text-to-image diffusion models to improve prompt-image alignment and authenticity, with faster latency and better realism compared to latent code optimization methods.


Supervised Representation Learning towards Generalizable Assembly State Recognition

http://arxiv.org/abs/2408.11700v1

Compressor summary: The paper proposes a representation learning approach with ISIL, a novel loss function modification, to improve assembly state recognition and robustness to execution errors.


Physics-informed Discovery of State Variables in Second-Order and Hamiltonian Systems

http://arxiv.org/abs/2408.11691v1

Compressor summary: A new method uses physical principles to improve neural network models for discovering state variables of dynamical systems.


Interpretable Long-term Action Quality Assessment

http://arxiv.org/abs/2408.11687v1

Compressor summary: The paper proposes a new method for evaluating long-term actions in videos that improves interpretability by addressing temporal skipping and using weight-score regression.


First line of defense: A robust first layer mitigates adversarial attacks

http://arxiv.org/abs/2408.11680v1

Compressor summary: The paper proposes a first layer design for neural networks that acts as an implicit adversarial noise filter, improving robustness to adversarial attacks without the need for additional training or computation.


Exploring Robustness of Visual State Space model against Backdoor Attacks

http://arxiv.org/abs/2408.11679v1

Compressor summary: The paper explores how the state space model mechanism in Visual State Space Model affects its robustness against backdoor attacks and suggests that adding a recurrent backdoor makes it more resilient to patch perturbations.


Macformer: Transformer with Random Maclaurin Feature Attention

http://arxiv.org/abs/2408.11656v1

Compressor summary: Macformer is a Transformer architecture that uses random Maclaurin features to approximate various dot-product kernels, speeding up attention computations for long sequences.


Video-to-Text Pedestrian Monitoring (VTPM): Leveraging Computer Vision and Large Language Models for Privacy-Preserve Pedestrian Activity Monitoring at Intersections

http://arxiv.org/abs/2408.11649v1

Compressor summary: The paper proposes Video-to-Text Pedestrian Monitoring (VTPM), which generates real-time textual reports on pedestrian movements at intersections, preserving privacy and enhancing safety analysis.


Optimizing Interpretable Decision Tree Policies for Reinforcement Learning

http://arxiv.org/abs/2408.11632v1

Compressor summary: The paper introduces Decision Tree Policy Optimization (DTPO), an algorithm that directly optimizes complete decision trees using policy gradients, making them interpretable alternatives to neural networks in reinforcement learning settings.


A Markovian Model for Learning-to-Optimize

http://arxiv.org/abs/2408.11629v1

Compressor summary: The paper proposes a probabilistic model for stochastic iterative algorithms that can learn their optimal parameters and predict their convergence rate and time based on empirical data.


Annealed Sinkhorn for Optimal Transport: convergence, regularization path and debiasing

http://arxiv.org/abs/2408.11620v1

Compressor summary: Annealed Sinkhorn, a variant of optimal transport solver, has conditions for convergence and can be improved by Debiased Annealed Sinkhorn to achieve faster annealing schedules.


Xinyu: An Efficient LLM-based System for Commentary Generation

http://arxiv.org/abs/2408.11609v1

Compressor summary: Xinyu is an LLM-based system that helps Chinese commentators create well-structured and logically consistent narratives by deconstructing the generation process into sequential steps and addressing advanced requirements with argument ranking, a comprehensive evidence database, and retrieval augmented generation technology.


Don't Kill the Baby: The Case for AI in Arbitration

http://arxiv.org/abs/2408.11608v1

Compressor summary: This article discusses the use of AI in arbitration, arguing that parties can choose AI arbitrators if they agree, and that this approach could improve efficiency, fairness, and flexibility in legal disputes.


Cause-Aware Empathetic Response Generation via Chain-of-Thought Fine-Tuning

http://arxiv.org/abs/2408.11599v1

Compressor summary: The paper introduces a new approach to generate empathetic responses by considering emotions and their causes, using a Chain-of-Thought prompt and external knowledge, and shows its effectiveness on LLaMA-7b.


Improving Calibration by Relating Focal Loss, Temperature Scaling, and Properness

http://arxiv.org/abs/2408.11598v1

Compressor summary: Focal loss training improves classifier calibration by raising confidence on training data and decomposing into a proper loss and a confidence-raising transformation.


Toward Enhancing Vehicle Color Recognition in Adverse Conditions: A Dataset and Benchmark

http://arxiv.org/abs/2408.11589v1

Compressor summary: This paper introduces a new dataset for vehicle color recognition and shows that nighttime scenes are challenging for current models.


Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

http://arxiv.org/abs/2408.11587v1

Compressor summary: The paper introduces EST-Bad, a novel and effective method for backdoor attacks on NLP systems using large language models, which hides the malicious trigger in the data more effectively than previous methods.


Drama Engine: A Framework for Narrative Agents

http://arxiv.org/abs/2408.11574v1

Compressor summary: The Drama Engine is a framework that uses multi-agent principles to create context-aware companions for narrative purposes with features like companion development, mood systems, and automatic context summarizing.


CHOTA: A Higher Order Accuracy Metric for Cell Tracking

http://arxiv.org/abs/2408.11571v1

Compressor summary: The CHOTA metric measures all aspects of cell tracking, including global coherence, and improves biological analysis by unifying the evaluation of cell detections, local associations, and lineage tracking.


AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

http://arxiv.org/abs/2408.11564v1

Compressor summary: AutoDirector is an AI framework that helps create realistic multi-sensory films by scheduling production steps, supporting interactive tasks, and improving user feedback.


Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control

http://arxiv.org/abs/2408.11561v1

Compressor summary: The Iterative Refinement Process (IRP) is a robust anomaly detection method that improves defect detection accuracy by removing misleading data points and outperforms traditional models in noisy environments.


Differentiating Choices via Commonality for Multiple-Choice Question Answering

http://arxiv.org/abs/2408.11554v1

Compressor summary: DCQA is a novel MCQA model that leverages token-level attention and semantic commonalities among choices to differentiate them and provide justifications for choosing the correct answer.


AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

http://arxiv.org/abs/2408.11553v1

Compressor summary: The paper introduces an extended dataset for human generation with diverse clothing items and backgrounds, and proposes AnyDesedn, a diffusion-based method for mask-free fashion image editing that uses Fashion DiT with Fashion-Guidance Attention to fuse apparel types and features.


Explainable Deep Learning Framework for Human Activity Recognition

http://arxiv.org/abs/2408.11552v1

Compressor summary: The paper proposes a new approach to improve human activity recognition models by using data augmentation techniques that provide clear and trustworthy explanations for their decisions.


Memorization In In-Context Learning

http://arxiv.org/abs/2408.11546v1

Compressor summary: This study finds that in-context learning improves large language model performance by surfacing memorized training data, with the effectiveness of this strategy depending on the level of memorization and the presence or absence of labels.


Evolution of Detection Performance throughout the Online Lifespan of Synthetic Images

http://arxiv.org/abs/2408.11541v1

Compressor summary: The paper studies how synthetic images change over time online and how current detectors struggle to tell them apart, proposing a method to improve their performance.


DeRainGS: Gaussian Splatting for Enhanced Scene Reconstruction in Rainy

http://arxiv.org/abs/2408.11540v1

Compressor summary: The study introduces 3DRRE, a novel task for reconstructing 3D scenes under rainy conditions, and presents DeRainGS, the first method tailored for this task, which performs better than existing methods.


Just Project! Multi-Channel Despeckling, the Easy Way

http://arxiv.org/abs/2408.11531v1

Compressor summary: MuChaPro is a framework that uses existing single-channel despeckling methods to improve multi-channel SAR images, with potential for self-supervised learning.


The Vizier Gaussian Process Bandit Algorithm

http://arxiv.org/abs/2408.11527v1

Compressor summary: Google Vizier is a successful Bayesian optimization service that has improved over time and performs well on various benchmarks.


RConE: Rough Cone Embedding for Multi-Hop Logical Query Answering on Multi-Modal Knowledge Graphs

http://arxiv.org/abs/2408.11526v1

Compressor summary: RConE is a novel embedding method for logical multi-hop question answering in Multi-Modal Knowledge Graphs, capturing sub-entities of multi-modal entities as answers.


EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

http://arxiv.org/abs/2408.11518v1

Compressor summary: EmoFace is a novel 3D virtual human model that uses Mesh Attention and self-growing training to generate realistic facial animations with emotions, overcoming data limitations and outperforming existing methods.


Imagining from Images with an AI Storytelling Tool

http://arxiv.org/abs/2408.11517v1

Compressor summary: The paper presents ImageTeller, a tool that uses GPT-4o to generate narratives from images or image sequences in various genres, allowing user interaction and influence on the story.


Quantifying Behavioural Distance Between Mathematical Expressions

http://arxiv.org/abs/2408.11515v1

Compressor summary: The paper introduces a new measure, BED, that clusters similar expressions with errors and improves the smoothness of the error landscape in symbolic regression.


Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

http://arxiv.org/abs/2408.11513v1

Compressor summary: The PDR-ANPG algorithm learns constrained Markov decision processes using entropy and quadratic regularizers, achieving improved sample complexity and last-iterate guarantees compared to previous methods.


IKUN for WMT24 General MT Task: LLMs Are here for Multilingual Machine Translation

http://arxiv.org/abs/2408.11512v1

Compressor summary: The paper presents two multilingual machine translation systems, IKUN and IKUN-C, which use large language models pre-trained on monolingual data and fine-tuned on parallel data, achieving high rankings in WMT24.


MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning

http://arxiv.org/abs/2408.11505v1

Compressor summary: MSCPT is a method that uses frozen large language models to generate multi-scale pathological visual language prior knowledge for few-shot weakly supervised whole slide image classification.


Slicing Input Features to Accelerate Deep Learning: A Case Study with Graph Neural Networks

http://arxiv.org/abs/2408.11500v1

Compressor summary: SliceGCN is a distributed graph learning method that slices node features to improve scalability and efficiency on large graphs by reducing memory consumption and communication overhead.


Mutagenesis screen to map the functionals of parameters of Large Language Models

http://arxiv.org/abs/2408.11494v1

Compressor summary: The study used a biological technique called mutagenesis screen to explore how different parameters in large language models affect their performance on various tasks, revealing complex relationships and potential ways to improve them.


XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary Classification of Chest X-Rays

http://arxiv.org/abs/2408.11493v1

Compressor summary: The study applies a cross-disease transferability framework to chest X-rays, aiming to use models trained on one pulmonary disease to diagnose another disease with similar implications for resource-limited settings and emerging diseases.


Estimating Peer Direct and Indirect Effects in Observational Network Data

http://arxiv.org/abs/2408.11492v1

Compressor summary: The paper proposes a method to estimate causal effects involving peer interactions using attention mechanisms and multi-layer graph neural networks, considering both direct and indirect effects, and incorporating structural information to enhance the model's performance.


Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

http://arxiv.org/abs/2408.11491v1

Compressor summary: SCANS is a method to improve safety alignment in large language models by steering activations and hidden state transitions, achieving better performance while avoiding overly cautious rejections.


DocTabQA: Answering Questions from Long Documents Using Tables

http://arxiv.org/abs/2408.11490v1

Compressor summary: The paper introduces DocTabQA, a new question answering task that uses structured tables derived from documents to answer questions, and presents DocTabTalk, a two-stage framework that improves GPT-4's performance on this task.


E-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

http://arxiv.org/abs/2408.11481v1

Compressor summary: E-Bench is a benchmark suite for evaluating text-driven video editing quality based on human perception, including a database with various videos, edits, and annotators' scores, as well as a new assessment network that aligns better with human preferences.


Learning Deep Dissipative Dynamics

http://arxiv.org/abs/2408.11479v1

Compressor summary: The study presents a method to transform neural networks into dissipative dynamical systems, ensuring stability, input-output stability, and energy conservation for various applications like robotic arms and fluid dynamics.


LAKD-Activation Mapping Distillation Based on Local Learning

http://arxiv.org/abs/2408.11478v1

Compressor summary: LAKD is a novel knowledge distillation framework that efficiently utilizes distilled information from teacher networks, achieving higher interpretability and competitive performance on various image classification tasks.


TrackGo: A Flexible and Efficient Method for Controllable Video Generation

http://arxiv.org/abs/2408.11475v1

Compressor summary: Summary: TrackGo is a novel method for controlling video generation using free-form masks and arrows, enhanced by the TrackAdapter that integrates into temporal self-attention layers of a pretrained model, achieving state-of-the-art results.


The Self-Contained Negation Test Set

http://arxiv.org/abs/2408.11469v1

Compressor summary: The authors propose an improved version of a test to evaluate how well pretrained language models handle negation, and find that most models still struggle with it.


MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation

http://arxiv.org/abs/2408.11465v1

Compressor summary: MeTTA is a test-time adaptation method for 3D reconstruction from single view images that uses generative prior, joint optimization, and learnable virtual cameras to handle out-of-distribution cases and achieve realistic appearance with physically based rendering.


Low-Light Object Tracking: A Benchmark

http://arxiv.org/abs/2408.11463v1

Compressor summary: The text introduces LLOT, a benchmark dataset for low-light object tracking, and proposes H-DCPT, a novel tracker that performs better than existing methods in such conditions.


Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation

http://arxiv.org/abs/2408.11457v1

Compressor summary: The text describes the creation and evaluation of a new Emakhuwa translation dataset for low-resource languages, highlighting challenges and data availability.


Using Part-based Representations for Explainable Deep Reinforcement Learning

http://arxiv.org/abs/2408.11455v1

Compressor summary: The paper proposes a non-negative training method for actor models in deep reinforcement learning, enabling more interpretable part-based representations while respecting non-negative constraints.


Bidirectional Gated Mamba for Sequential Recommendation

http://arxiv.org/abs/2408.11451v1

Compressor summary: The text introduces SIGMA, a framework that improves sequential recommendations by integrating a bidirectional Partially Flipped Mamba model with a Dense Selective Gate and a Feature Extract GRU.


Enabling Small Models for Zero-Shot Classification through Model Label Learning

http://arxiv.org/abs/2408.11449v1

Compressor summary: The paper proposes Model Label Learning (MLL), a method to leverage expert models' functionalities for zero-shot tasks by using a Semantic Directed Acyclic Graph (SDAG) and an algorithm to select suitable models from a model hub.


Lookism: The overlooked bias in computer vision

http://arxiv.org/abs/2408.11448v1

Compressor summary: The paper discusses how lookism, or bias based on physical appearance, is a significant and under-explored issue in computer vision and AI technologies, and calls for systematic study and development of equitable systems.


GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

http://arxiv.org/abs/2408.11447v1

Compressor summary: GaussianOcc is a fast and efficient method for self-supervised 3D occupancy estimation using Gaussian splatting techniques for projection and rendering.


Distributional Properties of Subword Regularization

http://arxiv.org/abs/2408.11443v1

Compressor summary: Subword regularization improves NLP models but is biased towards certain tokenizations; a new algorithm improves quality by sampling tokenizations more uniformly.


Epistemic Injustice in Generative AI

http://arxiv.org/abs/2408.11441v1

Compressor summary: The paper discusses how generative AI can harm collective knowledge and democracy by causing various types of epistemic injustice and suggests ways to design fairer AI systems.


LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems

http://arxiv.org/abs/2408.11440v1

Compressor summary: LAHAJA is a diverse benchmark dataset for Hindi ASR that reveals poor performance of existing models and highlights the need for multilingual training and better speaker representation.


BAdd: Bias Mitigation through Bias Addition

http://arxiv.org/abs/2408.11439v1

Compressor summary: BAdd is a method for learning fair representations in computer vision datasets that handles both single- and multi-attribute biases by incorporating features related to these attributes into the model.


DABench: A Benchmark Dataset for Data-Driven Weather Data Assimilation

http://arxiv.org/abs/2408.11438v1

Compressor summary: Key points: - LWMs are deep learning models that use NWP inputs but not autonomous yet - DABench is a benchmark dataset for data-driven DA models using ERA5 as ground truth - DABench has four standard features to guide and evaluate data-driven weather prediction systems - DaT is a strong baseline that integrates 4D variational DA into Transformer model Summary: The paper introduces DABench, a benchmark dataset for data-driven data assimilation models in weather prediction, with a strong baseline called DaT that combines 4D variational DA and Transformer model.


Towards Aligned Data Removal via Twin Machine Unlearning

http://arxiv.org/abs/2408.11433v1

Compressor summary: The Twin Machine Unlearning (TMU) technique aligns the unlearned model with the original model while removing data without affecting its accuracy.


T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

http://arxiv.org/abs/2408.11432v1

Compressor summary: T2VIndexer is a generative model that quickly generates video identifiers to improve the efficiency of text-video retrieval while maintaining high accuracy.


Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning

http://arxiv.org/abs/2408.11431v1

Compressor summary: LaMer is a label-free framework that diagnoses and remedies knowledge gaps in large language models using curricular meaningful learning and relative entropy.


EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

http://arxiv.org/abs/2408.11424v1

Compressor summary: Key points: - Facial expression recognition (FER) is important for emotional AI, but faces challenges in generalization and multimodality - Multimodal Large Language Models (MLLMs) can address some of these issues, but still need improvement - The paper introduces EMO-LLaMA, a novel MLLM that incorporates facial priors and age-gender-race attributes to enhance FER performance Summary: The paper presents EMO-LLaMA, a multimodal language model that leverages facial analysis and demographic information to improve facial expression recognition in emotional AI.


Towards "Differential AI Psychology" and in-context Value-driven Statement Alignment with Moral Foundations Theory

http://arxiv.org/abs/2408.11415v1

Compressor summary: The text discusses the challenges of using language models to generate politically nuanced content and suggests a framework for improving these representations.


Pano2Room: Novel View Synthesis from a Single Indoor Panorama

http://arxiv.org/abs/2408.11413v1

Compressor summary: The paper presents Pano2Room, a method that reconstructs high-quality 3D indoor scenes from a single panoramic image using a panoramic RGBD inpainter and a 3D Gaussian Splatting field.


Linear-time One-Class Classification with Repeated Element-wise Folding

http://arxiv.org/abs/2408.11412v1

Compressor summary: REF is a linear-time, easy-to-use one-class classification method that performs well on various benchmark datasets and provides robust default settings.


SelfDRSC++: Self-Supervised Learning for Dual Reversed Rolling Shutter Correction

http://arxiv.org/abs/2408.11411v1

Compressor summary: The paper proposes SelfDRSC++, a self-supervised learning framework for correcting rolling shutter distortion using dual reversed images, which can also generate high framerate global shutter videos from low framerate rolling shutter videos.


Domain-invariant Progressive Knowledge Distillation for UAV-based Object Detection

http://arxiv.org/abs/2408.11407v1

Compressor summary: The paper proposes a new knowledge distillation framework for UAV-based object detection, addressing the feature gap and background complexity issues to improve efficiency and performance.


Video Diffusion Models are Strong Video Inpainter

http://arxiv.org/abs/2408.11402v1

Compressor summary: The paper proposes a new method called FFF-VDI that uses image-to-video diffusion models to improve video inpainting by addressing noise and time consistency issues.


Revisiting FunnyBirds evaluation framework for prototypical parts networks

http://arxiv.org/abs/2408.11401v1

Compressor summary: The authors compare two visualizations for ProtoPNet explanations and suggest using similarity maps instead of bounding boxes to better align with the network's purpose.


EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning

http://arxiv.org/abs/2408.11397v1

Compressor summary: EAGLE is a novel framework that uses visual enhancement to improve geometric reasoning in large language models by leveraging both CLIP and LLM features.


MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

http://arxiv.org/abs/2408.11396v1

Compressor summary: MoE-LPR is a two-stage method to enhance multilingual capabilities in LLMs without forgetting original language proficiency by using Mixture-of-Experts and Language Priors Routing.


First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models

http://arxiv.org/abs/2408.11393v1

Compressor summary: TDA is a training-free method that improves the inference efficiency of large language models by exploiting their sparsity without compromising performance, and it reveals two critical features of LLM sparsity.


Fairness measures for biometric quality assessment

http://arxiv.org/abs/2408.11392v1

Compressor summary: The text discusses the importance of developing fairness measures for quality assessment algorithms in biometric systems to ensure equal performance across all individuals regardless of demographic characteristics.


Data-Centric Machine Learning for Earth Observation: Necessary and Sufficient Features

http://arxiv.org/abs/2408.11384v1

Compressor summary: This work uses model explanation methods to identify crucial features and reduce data usage for temporal multimodal geospatial machine learning models.


On the Interchangeability of Positional Embeddings in Multilingual Neural Machine Translation Models

http://arxiv.org/abs/2408.11382v1

Compressor summary: The text discusses switching from absolute to relative positional embeddings in neural machine translation models, showing that it improves performance with minimal or no loss while maintaining encoder-decoder architecture.


RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

http://arxiv.org/abs/2408.11381v1

Compressor summary: RAGLAB is an open-source library that allows researchers to compare and create Retrieval Augmented Generation (RAG) algorithms for large language models.


A Unified Framework for Continual Learning and Machine Unlearning

http://arxiv.org/abs/2408.11374v1

Compressor summary: The paper presents a novel framework that tackles continual learning and machine unlearning together using controlled knowledge distillation with a memory buffer, achieving good results on benchmark datasets.


Solving Decision Theory Problems with Probabilistic Answer Set Programming

http://arxiv.org/abs/2408.11371v1

Compressor summary: The paper presents a new approach to solve decision theory problems using Probabilistic Answer Set Programming and an efficient algorithm based on Algebraic Model Counting.


Graph Classification via Reference Distribution Learning: Theory and Practice

http://arxiv.org/abs/2408.11370v1

Compressor summary: GRDL is a new efficient and accurate graph classification method that uses node embeddings as discrete distributions and outperforms existing methods with global pooling operations.


Towards Probabilistic Inductive Logic Programming with Neurosymbolic Inference and Relaxation

http://arxiv.org/abs/2408.11367v1

Compressor summary: Propper is an improved inductive logic programming method that learns from probabilistic and flawed background knowledge using neurosymbolic inference, BCE, and NoisyCombo, achieving better results for relational patterns in noisy images than binary ILP and Graph Neural Networks.


GeoReasoner: Reasoning On Geospatially Grounded Context For Natural Language Understanding

http://arxiv.org/abs/2408.11366v1

Compressor summary: GeoReasoner is a language model that leverages Large Language Models and geospatial information to improve reasoning on geospatially grounded natural language, outperforming existing methods in three tasks.


Current Status and Trends in Image Anti-Forensics Research: A Bibliometric Analysis

http://arxiv.org/abs/2408.11365v1

Compressor summary: This paper reviews the state of research on image anti-forensics, which studies how to detect and prevent manipulation of human faces in images, using bibliometric analysis of publications in the Web of Science database.


ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

http://arxiv.org/abs/2408.11363v1

Compressor summary: ProteinGPT is a chatbot that uses GPT-4o to analyze protein sequences and structures and provide relevant answers to users.


Hypergraph Learning based Recommender System for Anomaly Detection, Control and Optimization

http://arxiv.org/abs/2408.11359v1

Compressor summary: The text presents a self-adapting anomaly detection framework that learns hypergraph structure and temporal trends in multisensor data, predicts anomalies based on forecast error, and provides root cause analysis and recommendations for remediation.


HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model

http://arxiv.org/abs/2408.11357v1

Compressor summary: Key points: - Paper proposes a novel method to generate physically-layered 3D humans from text prompts - Method uses a physically-decoupled diffusion model and a dual-representation decoupling framework - Method enables reusable and complex clothing matching for different body shapes Summary: The paper presents a new way to create realistic 3D clothed humans from text by separating the clothing and the body, using a diffusion model and a deformation network.


One-step Structure Prediction and Screening for Protein-Ligand Complexes using Multi-Task Geometric Deep Learning

http://arxiv.org/abs/2408.11356v1

Compressor summary: LigPose is a multi-task geometric deep learning model that accurately predicts protein-ligand binding without using docking tools, and shows promise for AI-based drug development.


Image Score: Learning and Evaluating Human Preferences for Mercari Search

http://arxiv.org/abs/2408.11349v1

Compressor summary: The authors propose and test a cost-efficient approach using a large language model to assess image quality in e-commerce settings, achieving a significant increase in sales on Mercari's web platform.


Multimodal Datasets and Benchmarks for Reasoning about Dynamic Spatio-Temporality in Everyday Environments

http://arxiv.org/abs/2408.11347v1

Compressor summary: Key points: - 3D simulator creates artificial video data for Embodied AI development - QA dataset measures robot's understanding of human behavior and home environment - Dataset useful in measuring AI's comprehension of daily life Summary: The paper presents a 3D simulator and a QA dataset to evaluate Embodied AI's ability to grasp human behavior and daily life in a home setting.


Clinical Context-aware Radiology Report Generation from Medical Images using Transformers

http://arxiv.org/abs/2408.11344v1

Compressor summary: The text discusses using transformer models for generating radiology reports from chest X-rays, comparing them to LSTM models, and suggesting new evaluation methods that consider both language and classification metrics.


Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond

http://arxiv.org/abs/2408.11338v1

Compressor summary: Automatic Dataset Construction (ADC) is a method that uses large language models and code generation to create image classification datasets with less manual annotation, but faces challenges such as label errors and imbalanced data distributions.


FATE: Focal-modulated Attention Encoder for Temperature Prediction

http://arxiv.org/abs/2408.11336v1

Compressor summary: The paper introduces a new framework, FATE, that uses the FocalNet Transformer architecture to improve temperature forecasting and climate change analysis by capturing complex patterns in meteorological data.


BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports

http://arxiv.org/abs/2408.11334v1

Compressor summary: Key points: - Breast ultrasound is important for detecting abnormalities, but radiology reports are unstructured and hard to extract information from. - The study develops an in-house LLM that matches GPT-4's performance in extracting clinical information from radiology reports, with cost and privacy benefits. Summary: The study creates a cheaper and more private alternative to GPT-4 for extracting breast ultrasound findings from unstructured radiology reports.


Design Principle Transfer in Neural Architecture Search via Large Language Models

http://arxiv.org/abs/2408.11330v1

Compressor summary: LAPT is a novel transfer paradigm for TNAS that uses a large language model to learn design principles from existing architectures, refine them, and reduce the search space, leading to improved efficiency and performance.


Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

http://arxiv.org/abs/2408.11327v1

Compressor summary: The paper proposes a zero-shot ensembling strategy to integrate different models for multimodal tasks, such as machine translation and image processing, by re-ranking beams during decoding.


Automating Thought of Search: A Journey Towards Soundness and Completeness

http://arxiv.org/abs/2408.11326v1

Compressor summary: AutoToS is a method that automates planning problem solving by generating search components using language models, achieving 100% accuracy without human intervention.


Optimizing Transmit Field Inhomogeneity of Parallel RF Transmit Design in 7T MRI using Deep Learning

http://arxiv.org/abs/2408.11323v1

Compressor summary: The study proposes a novel deep learning-based method to improve B1+ field homogeneity in ultrahigh field MRI, leading to faster and better image quality.


Towards Evaluating Large Language Models on Sarcasm Understanding

http://arxiv.org/abs/2408.11319v1

Compressor summary: This paper evaluates 11 LLMs and 8 PLMs on six sarcasm benchmark datasets and finds that LLMs underperform PLMs, with GPT-4 being the best performer, and few-shot IO prompting being the most effective method.


Probabilistic Medical Predictions of Large Language Models

http://arxiv.org/abs/2408.11316v1

Compressor summary: The text discusses the challenges of using large language models for clinical predictions due to their unreliable prediction probabilities, which are essential for decision-making, and suggests more research is needed.


Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer

http://arxiv.org/abs/2408.11313v1

Compressor summary: ECLIPSE is a novel black-box jailbreaking method that uses natural language instructions and LLM self-reflection to generate adversarial suffixes for malicious queries with high efficiency.


Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework

http://arxiv.org/abs/2408.11312v1

Compressor summary: The paper proposes a new framework for visual geo-localization using multiple Large Vision-Language Models that communicate with each other and learn dynamic patterns to improve performance, showing better results than existing methods on a new dataset called GeoGlobe.


Improving Out-of-Distribution Data Handling and Corruption Resistance via Modern Hopfield Networks

http://arxiv.org/abs/2408.11309v1

Compressor summary: The study proposes using Modern Hopfield Networks to improve computer vision models' robustness against minor perturbations like blurring, achieving state-of-the-art results on MNIST-C dataset.


EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models

http://arxiv.org/abs/2408.11308v1

Compressor summary: EEG-Defender is a novel defense approach that uses early transformer outputs to detect and stop malicious inputs in large language models.


KAN4TSF: Are KAN and KAN-based models Effective for Time Series Forecasting?

http://arxiv.org/abs/2408.11306v1

Compressor summary: The paper introduces the Kolmogorov-Arnold Network (KAN) for time series forecasting, which improves on existing methods by having better mathematical properties and interpretability.


UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation

http://arxiv.org/abs/2408.11305v1

Compressor summary: UniFashion is a unified framework that handles multimodal generation and retrieval tasks in the fashion domain by integrating image generation with retrieval tasks and text generation tasks, achieving superior performance compared to previous single-task models.


Koopman AutoEncoder via Singular Value Decomposition for Data-Driven Long-Term Prediction

http://arxiv.org/abs/2408.11303v1

Compressor summary: The paper proposes using SVD of the Koopman matrix to improve long-term prediction in nonlinear dynamics modeling by adjusting singular values, which affect eigenvalues and thus forecasting performance.


Modeling Reference-dependent Choices with Graph Neural Networks

http://arxiv.org/abs/2408.11302v1

Compressor summary: The authors propose ArcRec, a deep learning framework for modeling reference-dependent preferences in recommender systems, using historical purchase records, attribute-level networks, and a novel utility function that considers interest and price sensitivity.


Offline Policy Learning via Skill-step Abstraction for Long-horizon Goal-Conditioned Tasks

http://arxiv.org/abs/2408.11300v1

Compressor summary: The paper proposes an offline method for learning goal-conditioned policies that use skills and abstraction to handle long-horizon goals with distribution shifts.


Making Large Vision Language Models to be Good Few-shot Learners

http://arxiv.org/abs/2408.11297v1

Compressor summary: This paper explores the challenges of using large vision language models (LVLMs) in few-shot classification tasks and proposes a meta-learning approach with label augmentation and candidate selection to improve their performance.


RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining

http://arxiv.org/abs/2408.11294v1

Compressor summary: RedWhale is a Korean-specific NLP model that uses continual pretraining and cross-lingual transfer learning to improve accuracy and comprehension while reducing training time and costs.


Applying and Evaluating Large Language Models in Mental Health Care: A Scoping Review of Human-Assessed Generative Tasks

http://arxiv.org/abs/2408.11288v1

Compressor summary: Large language models show promise for mental health care but need more rigorous evaluation and ethical considerations before being widely used in clinical settings.


Taming Generative Diffusion for Universal Blind Image Restoration

http://arxiv.org/abs/2408.11287v1

Compressor summary: Key points: - Diffusion models are used for image restoration but need assumptions about degradation model - BIR-D is a universal blind image restoration method that adapts to different degradation models and updates parameters dynamically - BIR-D outperforms existing methods on real and synthetic datasets and can handle multiple and complex degradations Summary: BIR-D is a novel diffusion-based method for blind image restoration that can adjust to various degradation models without assumptions and achieve superior results.


BearLLM: A Prior Knowledge-Enhanced Bearing Health Management Framework with Unified Vibration Signal Representation

http://arxiv.org/abs/2408.11281v1

Compressor summary: BearLLM is a novel framework that uses large language models and vibration signals to manage bearing health by processing user prompts and unifying different tasks.


On Missing Scores in Evolving Multibiometric Systems

http://arxiv.org/abs/2408.11271v1

Compressor summary: The text discusses improving biometric system accuracy by filling in missing scores using various score imputation methods and simple sum fusion, especially for high proportions of missing data.


Inverting the Leverage Score Gradient: An Efficient Approximate Newton Method

http://arxiv.org/abs/2408.11267v1

Compressor summary: The paper proposes an iterative algorithm to recover model parameters using leverage scores gradient, improving data privacy and security, with low time complexity.


Practical Aspects on Solving Differential Equations Using Deep Learning: A Primer

http://arxiv.org/abs/2408.11266v1

Compressor summary: This paper explains deep learning and the Deep Galerkin method for solving differential equations, providing step-by-step examples and code snippets.


Towards Analyzing and Mitigating Sycophancy in Large Vision-Language Models

http://arxiv.org/abs/2408.11261v1

Compressor summary: The paper analyzes and proposes a method to reduce sycophancy, the undue influence of leading or deceptive prompts on large vision-language models, improving their performance and reducing biased outputs.


Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

http://arxiv.org/abs/2408.11258v1

Compressor summary: The paper proposes a sampling-based method and a sequence-to-sequence model for simulating speech recognition errors in neural network acoustic models, improving their robustness and performance on unseen data.


Automatic Image Annotation (AIA) of AlmondNet-20 Method for Almond Detection by Improved CNN-based Model

http://arxiv.org/abs/2408.11253v1

Compressor summary: The paper presents a method using Deep Convolutional Neural Networks to accurately grade almonds and their shells, improving global agricultural product classification and trade.


Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

http://arxiv.org/abs/2408.11252v1

Compressor summary: The paper proposes a method to evaluate the faithfulness of explanation methods for autoregressive language models using counterfactual generation.


Irregularity Inspection using Neural Radiance Field

http://arxiv.org/abs/2408.11251v1

Compressor summary: The paper proposes a system that uses neural network modeling to create 3D twin models and compare them for defect detection in large machinery, reducing the need for manual inspections by personnel.


CNN-based Labelled Crack Detection for Image Annotation

http://arxiv.org/abs/2408.11250v1

Compressor summary: The paper proposes a vision-based approach using deep CNNs for crack detection in AM surfaces with high accuracy and efficiency.