arxiv compressed, 2024-06-28

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-06-28 generated by the compressor, my personal LLM-based project.


Dataset Size Recovery from LoRA Weights

http://arxiv.org/abs/2406.19395v1

Compressor summary: The paper introduces dataset size recovery, a method to estimate how many samples were used to train a model, using its weights and LoRA matrices.


HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection

http://arxiv.org/abs/2406.19394v1

Compressor summary: The paper introduces a self-training framework called HUWSOD for weakly supervised object detection, which uses innovative proposal generators and does not require external modules or additional supervision.


Looking 3D: Anomaly Detection with 2D-3D Alignment

http://arxiv.org/abs/2406.19393v1

Compressor summary: The paper introduces a new anomaly detection problem using 3D shapes and a large dataset of images with diverse anomalies, and proposes a transformer-based approach to solve it.


ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

http://arxiv.org/abs/2406.19392v1

Compressor summary: ReXTime is a benchmark to test AI models' ability to reason about cause-and-effect relationships across video segments, and it shows that current models are not yet as good as humans at this task.


Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

http://arxiv.org/abs/2406.19391v1

Compressor summary: Fibottention is a sparse, efficient, and general self-attention architecture based on Fibonacci sequences for visual tasks that captures fine-grained details while reducing computational overhead.


SALVe: Semantic Alignment Verification for Floorplan Reconstruction from Sparse Panoramas

http://arxiv.org/abs/2406.19390v1

Compressor summary: Key points: - New system for automatic 2D floorplan reconstruction using SALVe, a novel learned alignment verifier - Inputs: sparse 360° panoramas with semantic features of windows, doors, and openings - Outputs: room poses, layouts, and floorplan - Outperforms state-of-the-art SfM systems in completeness and accuracy Summary: The authors present a new system that uses SALVe, a learned alignment verifier, to reconstruct 2D floorplans from sparse 360° panoramas with semantic features, achieving better results than existing methods.


OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

http://arxiv.org/abs/2406.19389v1

Compressor summary: OMG-LLaVA is a framework that combines pixel-level image understanding with reasoning abilities, enabling flexible user interaction via visual and text prompts.


The Remarkable Robustness of LLMs: Stages of Inference?

http://arxiv.org/abs/2406.19384v1

Compressor summary: The text shows that Large Language Models are very robust and can still predict well even after layers are deleted or swapped, and suggests that there are four stages of inference across different models.


TabReD: A Benchmark of Tabular Machine Learning in-the-Wild

http://arxiv.org/abs/2406.19380v1

Compressor summary: TabReD is a new collection of tabular machine learning benchmarks that reflect real-world scenarios by including time-based splits and feature engineering.


Suri: Multi-constraint Instruction Following for Long-form Text Generation

http://arxiv.org/abs/2406.19371v1

Compressor summary: This paper introduces Suri, a dataset for multi-constraint instruction following in long-form text generation, and proposes Instructional ORPO (I-ORPO), an alignment method that uses synthetic negative feedback from LLM-generated corrupted instructions to improve quality.


Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space

http://arxiv.org/abs/2406.19370v1

Compressor summary: The authors propose analyzing how generative models learn abstract concepts using a concept space framework, finding that hidden capabilities emerge suddenly in the learning process.


Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

http://arxiv.org/abs/2406.19369v1

Compressor summary: RWKV-SAM is a fast and accurate segmentation model that combines convolution and radial wave kernel correlation (RWKV) operations in its backbone and an efficient decoder for multiscale tokens.


SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues

http://arxiv.org/abs/2406.19364v1

Compressor summary: SimTxtSeg is a novel framework that uses simple text cues to generate pseudo-labels and fuse text and image features for weakly-supervised medical image segmentation.


STAL3D: Unsupervised Domain Adaptation for 3D Object Detection via Collaborating Self-Training and Adversarial Learning

http://arxiv.org/abs/2406.19362v1

Compressor summary: The paper proposes a new framework, STAL3D, that combines self-training and adversarial learning for unsupervised domain adaptation in 3D object detection, improving performance on cross-domain tasks and addressing issues like background interference and source domain size bias.


The Model Arena for Cross-lingual Sentiment Analysis: A Comparative Study in the Era of Large Language Models

http://arxiv.org/abs/2406.19358v1

Compressor summary: This text compares the cross-lingual sentiment analysis performance of SMLMs and LLMs, finding that SMLMs excel in zero-shot settings while LLMs adapt better in few-shot settings.


DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions

http://arxiv.org/abs/2406.19356v1

Compressor summary: DiVERT is a novel method for generating and understanding implausible multiple-choice question distractors in math by using a large language model.


Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?

http://arxiv.org/abs/2406.19354v1

Compressor summary: The paper discusses challenges and proposes a testbed for model editing, which involves updating knowledge in language models, using a semi-synthetic dataset based on Wikidata.


CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement

http://arxiv.org/abs/2406.19353v1

Compressor summary: The paper introduces CORE4D, a large-scale 4D human-object-human interaction dataset that helps study collaborative object rearrangement and provides new challenges for generating human-object interactions.


IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

http://arxiv.org/abs/2406.19349v1

Compressor summary: The paper introduces IndoToxic2024, a dataset for Indonesian hate speech and toxicity classification, focusing on vulnerable groups during the presidential election, and evaluates its effectiveness with various models.


Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation

http://arxiv.org/abs/2406.19341v1

Compressor summary: The paper proposes a bi-level learning method for a visual conditioning token that can adapt a deep neural network model to different domains in image classification, improving its performance by up to 1.9%.


Efficient World Models with Context-Aware Tokenization

http://arxiv.org/abs/2406.19320v1

Compressor summary: $\Delta$-IRIS is a fast and efficient model-based RL agent that uses discrete autoencoders and autoregressive transformers to predict future deltas, achieving state of the art results on the Crafter benchmark.


Jump Starting Bandits with LLM-Generated Prior Knowledge

http://arxiv.org/abs/2406.19317v1

Compressor summary: The paper shows how Large Language Models can improve Contextual Multi-Armed Bandits by simulating human preferences, reducing online learning regret and data-gathering costs.


Enhanced Data Transfer Cooperating with Artificial Triplets for Scene Graph Generation

http://arxiv.org/abs/2406.19316v1

Compressor summary: This paper proposes two methods to improve training data for Scene Graph Generation, which are Feature Space Triplet Augmentation and Soft Transfer, and shows their effectiveness in achieving high Recall scores.


LiveBench: A Challenging, Contamination-Free LLM Benchmark

http://arxiv.org/abs/2406.19314v1

Compressor summary: LiveBench is a new benchmark for large language models that updates questions from recent sources, scores answers automatically, and covers various challenging tasks to avoid test set contamination and biases.


The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning

http://arxiv.org/abs/2406.19307v1

Compressor summary: This paper reviews the current state of research on commonsense causality, which is essential for human intelligence and decision-making, but lacks systematic exploration.


Mapping Land Naturalness from Sentinel-2 using Deep Contextual and Geographical Priors

http://arxiv.org/abs/2406.19302v1

Compressor summary: Key points: - Climate change is accelerating due to human actions - Satellite images help observe and measure effects on natural areas - A deep learning framework maps land naturalness from Sentinel-2 data using contextual and geographical priors - Quantifying naturalness aids environmental stewardship Summary: The text describes how satellite images and a deep learning framework can map land naturalness, which is affected by human actions and climate change, to help protect the environment.


MCNC: Manifold Constrained Network Compression

http://arxiv.org/abs/2406.19301v1

Compressor summary: MCNC is a new method to compress large AI models by constraining their parameter space to low-dimensional nonlinear manifolds, achieving high compression rates and performance across various tasks.


scTree: Discovering Cellular Hierarchies in the Presence of Batch Effects in scRNA-seq Data

http://arxiv.org/abs/2406.19300v1

Compressor summary: scTree is a new method for single-cell RNA sequencing data that corrects batch effects and learns a tree structure representing clusters and their hierarchies, improving understanding of cellular landscapes.


PNeRV: A Polynomial Neural Representation for Videos

http://arxiv.org/abs/2406.19299v1

Compressor summary: PNeRV is a patch-wise implicit neural representation for videos that preserves spatiotemporal continuity using polynomial neural networks and hierarchical sampling, achieving better performance in tasks like compression and downstream applications.


Compositional Image Decomposition with Diffusion Models

http://arxiv.org/abs/2406.19298v1

Compressor summary: The paper introduces Decomp Diffusion, an unsupervised method that decomposes images into compositional components, allowing for flexible scene composition.


Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

http://arxiv.org/abs/2406.19297v1

Compressor summary: This paper explores how different modalities (e.g., vision and language) evolve at different rates when training models on a sequence of tasks, proposes a modality-aware feature distillation method to improve performance, and shows its effectiveness in multimodal continual learning settings.


From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

http://arxiv.org/abs/2406.19292v1

Compressor summary: Finetuning large language models on a synthetic dataset of numerical key-value retrieval tasks enhances their information retrieval and reasoning abilities in long-context settings without causing hallucination or sacrificing general benchmark performance.


Human Modelling and Pose Estimation Overview

http://arxiv.org/abs/2406.19290v1

Compressor summary: This paper investigates various methods and applications in 2D and 3D human pose estimation, comparing state-of-the-art algorithms and discussing future directions.


HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

http://arxiv.org/abs/2406.19280v1

Compressor summary: The paper introduces PubMedVision, a large and high-quality dataset for medical image-text pairs created by refining existing data and using GPT-4V to denoise and reformat it, leading to improved medical multimodal capabilities of language models.


VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation

http://arxiv.org/abs/2406.19276v1

Compressor summary: VERISCORE is a metric for evaluating factuality in diverse long-form generation tasks that can handle both verifiable and unverifiable claims, and it reveals that different models perform differently across tasks.


Stochastic Concept Bottleneck Models

http://arxiv.org/abs/2406.19272v1

Compressor summary: SCBMs use distributional parameterization to model concept dependencies and improve intervention effectiveness in interpretable Concept Bottleneck Models.


AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

http://arxiv.org/abs/2406.19271v1

Compressor summary: The research proposes a system that collects and filters web data using trusted AI models to ensure pure and reliable training data for Large Language Models.


Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

http://arxiv.org/abs/2406.19263v1

Compressor summary: The paper introduces a Tree-of-Lens (ToL) agent that uses a Hierarchical Layout Tree to understand and describe screen content and layout based on user-indicated points, outperforming other tools on a new Screen Point-and-Read benchmark.


Leveraging Contrastive Learning for Enhanced Node Representations in Tokenized Graph Transformers

http://arxiv.org/abs/2406.19258v1

Compressor summary: GCFormer is a new graph Transformer that creates positive and negative token sequences to better capture diverse graph information and improve node representation quality for node classification tasks.


AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AI

http://arxiv.org/abs/2406.19256v1

Compressor summary: AIDRIN is a framework that evaluates the quality and suitability of data for AI using various metrics, visualizations, and reports.


Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

http://arxiv.org/abs/2406.19255v1

Compressor summary: This paper proposes a method called Finsta to improve video-language models by aligning text and video using fine-grained scene graphs, achieving better results on various tasks.


Advection Augmented Convolutional Neural Networks

http://arxiv.org/abs/2406.19253v1

Compressor summary: The authors propose a new architecture for predicting space-time sequences that combines CNNs with advection and reaction-diffusion components, inspired by physical processes, to improve performance and explainability.


AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation

http://arxiv.org/abs/2406.19251v1

Compressor summary: The AutoRAG-HP framework uses a hierarchical multi-armed bandit method to efficiently tune hyper-parameters for Retrieval-Augmented Generation systems, achieving high recall with less API calls than Grid Search.


NTFormer: A Composite Node Tokenized Graph Transformer for Node Classification

http://arxiv.org/abs/2406.19249v1

Compressor summary: NTFormer is a new graph Transformer that uses a novel token generator called Node2Par to express rich graph features, enabling it to learn node representations without requiring graph-specific modifications.


Local Manifold Learning for No-Reference Image Quality Assessment

http://arxiv.org/abs/2406.19247v1

Compressor summary: The paper proposes a new method for assessing image quality by combining local manifold learning with contrastive learning, which improves differentiation and performance compared to existing methods.


Improving the Expressiveness of $K$-hop Message-Passing GNNs by Injecting Contextualized Substructure Information

http://arxiv.org/abs/2406.19244v1

Compressor summary: Key points: - GNNs are powerful for graph learning but have limited expressiveness - $K$-hop GNNs aggregate information from neighbors within $K$ hops - Substructure encoding function enhances expressive power of $K$-hop GNNs - Method is provably more powerful than previous works and achieves state-of-the-art results Summary: The paper proposes a substructure encoding function to improve the expressive power of $K$-hop graph neural networks, which outperforms previous works and matches 3-WL test.


Revealing Fine-Grained Values and Opinions in Large Language Models

http://arxiv.org/abs/2406.19238v1

Compressor summary: The study analysed how large language models respond to political statements and found that demographic features and question formats affect their stances, revealing biases and patterns in their justifications.


FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

http://arxiv.org/abs/2406.19237v1

Compressor summary: FlowVQA is a new benchmark for testing visual question-answering models on flowcharts with various reasoning tasks.


Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

http://arxiv.org/abs/2406.19236v1

Compressor summary: HA-VLN is a new approach to navigation that considers dynamic human activities and uses novel datasets and agents to improve real-world applicability and robustness.


RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

http://arxiv.org/abs/2406.19232v1

Compressor summary: RuBLiMP is a new benchmark for testing Russian language models' grammatical knowledge by providing diverse minimal pairs covering various linguistic phenomena.


Tools Fail: Detecting Silent Errors in Faulty Tools

http://arxiv.org/abs/2406.19228v1

Compressor summary: The text discusses a framework for LLMs to detect silent tool errors and plan better, focusing on their use as tools rather than just choosing them.


Aligning Teacher with Student Preferences for Tailored Training Data Generation

http://arxiv.org/abs/2406.19227v1

Compressor summary: ARTE is a framework that aligns teacher instructional content with student preferences to generate tailored training examples for Knowledge Distillation on edge devices.


Simulating Classroom Education with LLM-Empowered Agents

http://arxiv.org/abs/2406.19226v1

Compressor summary: Key points: - The paper proposes SimClass, a multi-agent framework using LLMs for virtual classroom teaching - SimClass simulates classroom interaction patterns and enhances user experience - Agents collaborate to create enlivening interactions and improve learning process Summary: SimClass is a novel framework that uses large language models to simulate and enhance classroom interactions in a virtual setting, improving the user's learning experience.


ProtoGMM: Multi-prototype Gaussian-Mixture-based Domain Adaptation Model for Semantic Segmentation

http://arxiv.org/abs/2406.19225v1

Compressor summary: The ProtoGMM model uses Gaussian mixtures to estimate multi-prototype distributions for semantic segmentation in unlabeled target domains by leveraging supervised models from labeled source domains, improving predictions with contrastive learning.


T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

http://arxiv.org/abs/2406.19223v1

Compressor summary: T-FREE is a novel tokenizer that improves efficiency and cross-lingual transfer learning by embedding words using sparse activation patterns over character triplets without a reference corpus.


Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

http://arxiv.org/abs/2406.19217v1

Compressor summary: The paper proposes a novel error detection method for robot-assisted surgeries using contextual information from videos and reasoning modules inspired by natural language processing.


SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation

http://arxiv.org/abs/2406.19215v1

Compressor summary: SeaKR is a new model that uses LLMs' self-aware uncertainty to retrieve and integrate knowledge for better question answering.


Estimating Long-term Heterogeneous Dose-response Curve: Generalization Bound Leveraging Optimal Transport Weights

http://arxiv.org/abs/2406.19195v1

Compressor summary: The paper proposes a method to estimate the long-term heterogeneous dose-response curve by using optimal transport weighting to account for unobserved confounders and providing theoretical guarantees for counterfactual prediction error.


BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring

http://arxiv.org/abs/2406.19189v1

Compressor summary: The study proposes BENDR, a BERT-based model that uses pre-training and fine-tuning on large datasets to improve seizure detection from EEG recordings with low false positives and acceptable sensitivity.


Averaging log-likelihoods in direct alignment

http://arxiv.org/abs/2406.19188v1

Compressor summary: The paper introduces a method to make direct alignment in large language models more consistent across different completion lengths using a new averaging operator.


Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion

http://arxiv.org/abs/2406.19185v1

Compressor summary: CoPG is an RL algorithm that can finetune LLMs with off-policy data and outperforms direct alignment methods and policy gradient in generalization.


Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data

http://arxiv.org/abs/2406.19175v1

Compressor summary: The paper discusses using synthetic and real-world X-ray images to train an object detection model, finding that a mix of synthetic and unlabeled real-world data is more cost-efficient than using only labeled real-world data.


Annotation Errors and NER: A Study with OntoNotes 5.0

http://arxiv.org/abs/2406.19172v1

Compressor summary: The paper presents three techniques to detect and fix annotation errors in the OntoNotes 5.0 English NER corpus, improving model performance.


The Illusion of Competence: Evaluating the Effect of Explanations on Users' Mental Models of Visual Question Answering Systems

http://arxiv.org/abs/2406.19170v1

Compressor summary: The study investigates if providing explanations helps users understand an AI system's limitations when it performs poorly on a visual task involving full-color or grayscale images, but finds that explanations do not improve users' perceptions of the system's capabilities and limitations.


Single Image Estimation of Cell Migration Direction by Deep Circular Regression

http://arxiv.org/abs/2406.19162v1

Compressor summary: The paper proposes a new method for estimating the migration direction of cells from a single image using deep circular regression and achieving better accuracy than previous methods.


Heterogeneous Causal Metapath Graph Neural Network for Gene-Microbe-Disease Association Prediction

http://arxiv.org/abs/2406.19156v1

Compressor summary: The paper proposes a neural network (HCMGNN) to predict gene-microbe-disease associations using causal metapaths and semantic sharing.


Advancing operational PM2.5 forecasting with dual deep neural networks (D-DNet)

http://arxiv.org/abs/2406.19154v1

Compressor summary: The dual deep neural network (D-DNet) is a more efficient and accurate model for predicting PM2.5 and AOD550 levels than traditional models or the CAMS 4D-Var system, using real-time observations to improve forecasting.


RAVEN: Multitask Retrieval Augmented Vision-Language Learning

http://arxiv.org/abs/2406.19150v1

Compressor summary: RAVEN is a framework that enhances vision-language models with retrieval augmentation for multiple tasks, improving efficiency and performance.


BackMix: Mitigating Shortcut Learning in Echocardiography with Minimal Supervision

http://arxiv.org/abs/2406.19148v1

Compressor summary: The authors propose a random background augmentation method called BackMix for neural networks to improve echocardiogram view classification by focusing on the image content and reducing spurious correlations.


Resolving Discrepancies in Compute-Optimal Scaling of Language Models

http://arxiv.org/abs/2406.19146v1

Compressor summary: The paper compares two scaling laws for optimal model size as a function of compute budget, identifies factors causing discrepancies, and shows how to obtain agreement with one law by correcting these factors.


YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention

http://arxiv.org/abs/2406.19136v1

Compressor summary: Key points: - Solubility prediction is crucial for drug effectiveness and safety - Traditional methods fail to capture complex molecular structures - Novel deep learning framework combines attention-based transformers, LSTM networks, and GCN - Outperforms benchmark models and offers insights for drug design and selection Summary: The text presents a novel deep learning framework that improves the prediction of drug solubility by capturing complex molecular structures better than traditional methods, with potential applications for drug discovery.


CELLO: Causal Evaluation of Large Vision-Language Models

http://arxiv.org/abs/2406.19131v1

Compressor summary: The authors introduce CELLO, a novel dataset for causal reasoning tasks involving interactions between humans and objects, and propose CELLO-CoT, a chain-of-thought prompting strategy to improve large vision-language models' performance on these tasks.


Evidential Concept Embedding Models: Towards Reliable Concept Explanations for Skin Disease Diagnosis

http://arxiv.org/abs/2406.19130v1

Compressor summary: The paper proposes an evidential Concept Embedding Model that improves interpretable deep learning methods in medical image analysis by modeling concept uncertainty and rectifying misalignments for better clinical diagnosis explanations.


Towards Learning Abductive Reasoning using VSA Distributed Representations

http://arxiv.org/abs/2406.19121v1

Compressor summary: ARLC is a model that learns abductive reasoning with context, achieving high accuracy and interpretability on Raven's matrices tasks, surpassing neuro-symbolic and large language models with fewer parameters.


CHEW: A Dataset of CHanging Events in Wikipedia

http://arxiv.org/abs/2406.19116v1

Compressor summary: CHEW is a dataset of Wikipedia changes in text, used to test LLMs' timeline understanding and identify meaning shifts.


A Teacher Is Worth A Million Instructions

http://arxiv.org/abs/2406.19112v1

Compressor summary: The paper proposes an improved training method for smaller LLMs by leveraging knowledge from larger models and a novel post-training domain alignment phase, achieving better results than state-of-the-art models with more parameters.


FDLite: A Single Stage Lightweight Face Detector Network

http://arxiv.org/abs/2406.19107v1

Compressor summary: The text describes a new lightweight face detector with a customized backbone network and two multi-task losses that achieves high accuracy on the WIDER FACE dataset.


Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

http://arxiv.org/abs/2406.19102v1

Compressor summary: Statements are a new data structure for extracting quantitative facts from ESG reports using deep learning models like SemTabNet, which can facilitate exploratory data analysis.


DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

http://arxiv.org/abs/2406.19101v1

Compressor summary: DocKylin is a document-centric MLLM that uses pixel-level slimming and token-level slimming to improve visual content understanding in high-resolution document images.


Fairness and Bias in Multimodal AI: A Survey

http://arxiv.org/abs/2406.19097v1

Compressor summary: This survey examines 50 examples of datasets and models for fairness and bias in Large Multimodal Models (LMMs) and proposes a new category of quantifying bias, preuse.


Adaptive Stochastic Weight Averaging

http://arxiv.org/abs/2406.19092v1

Compressor summary: ASWA is a technique that improves generalization by updating a running average of model parameters only when it helps the validation performance, combining SWA and early stopping.


Dimensions underlying the representational alignment of deep neural networks with humans

http://arxiv.org/abs/2406.19087v1

Compressor summary: The text compares how humans and artificial neural networks represent images, finding similarities and differences in their strategies and highlighting the need for better alignment.


AMBROSIA: A Benchmark for Parsing Ambiguous Questions into Database Queries

http://arxiv.org/abs/2406.19073v1

Compressor summary: AMBROSIA is a new benchmark for testing text-to-SQL parsers' ability to handle ambiguous questions with different types of uncertainty, using controlled databases generated from scratch.


EmPO: Theory-Driven Dataset Construction for Empathetic Response Generation through Preference Optimization

http://arxiv.org/abs/2406.19071v1

Compressor summary: The paper proposes a new method to improve empathetic responses from conversational agents using preference datasets and optimization algorithms, evaluating it on two metrics and a benchmark dataset.


FAGhead: Fully Animate Gaussian Head from Monocular Videos

http://arxiv.org/abs/2406.19070v1

Compressor summary: FAGhead is a method that creates realistic 3D human avatars from monocular videos with controllable expressions and poses, using Point-based Learnable Representation Field (PLRF) and alpha rendering.


Dancing in the Shadows: Harnessing Ambiguity for Fairer Classifiers

http://arxiv.org/abs/2406.19066v1

Compressor summary: The paper proposes using uncertain identity data to train classifiers for better algorithmic fairness.


STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

http://arxiv.org/abs/2406.19065v1

Compressor summary: The paper introduces STBench, a benchmark dataset for evaluating spatio-temporal understanding in large language models, and assesses 13 LLMs on four dimensions of this capability.


Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO

http://arxiv.org/abs/2406.19057v1

Compressor summary: REC-based detection uses language descriptions to detect objects not in existing class names, but may make false positive predictions; however, these can be mitigated by filtering detections by size, improving semantic segmentation and data annotation efficiency.


SimpleFusion: A Simple Fusion Framework for Infrared and Visible Images

http://arxiv.org/abs/2406.19055v1

Compressor summary: SimpleFusion is a simple framework for fusing visible and infrared images using two plain convolutional neural networks without downsampling, preserving complementary information between the modalities.


A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)

http://arxiv.org/abs/2406.19054v1

Compressor summary: The text describes an accessible prototype system for non-experts to interact with machine learning and gain insights into user behavior, using a novel methodology for interactive machine learning and multimodal interaction.


Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation

http://arxiv.org/abs/2406.19049v1

Compressor summary: The study explores when the accuracy-on-the-line relationship in machine learning breaks down due to noisy data, nuisance features, and spurious features, leading to "Accuracy-on-the-wrong-line".


BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

http://arxiv.org/abs/2406.19048v1

Compressor summary: The paper proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion, that can achieve robust semantic- and spatial-aware 3D object detection by mutually enhancing multi-modal features and adaptatively selecting them.


On Convex Optimization with Semi-Sensitive Features

http://arxiv.org/abs/2406.19040v1

Compressor summary: The paper analyzes differential privacy in empirical risk minimization for semi-sensitive data and provides better bounds on excess risk.


Improving Weak-to-Strong Generalization with Reliability-Aware Alignment

http://arxiv.org/abs/2406.19032v1

Compressor summary: The paper proposes a method to improve large language models' generalization by using the reliability of weak supervision signals in the alignment process, which helps reduce errors from noisy supervision and enhance model accuracy.


Using diffusion model as constraint: Empower Image Restoration Network Training with Diffusion Model

http://arxiv.org/abs/2406.19030v1

Compressor summary: The paper introduces DiffLoss, a naturalness-oriented and semantic-aware optimization mechanism for image restoration that leverages diffusion models' distribution coverage and high-level semantic space to improve visual and semantic perception quality.


Lithium-Ion Battery System Health Monitoring and Fault Analysis from Field Data Using Gaussian Processes

http://arxiv.org/abs/2406.19015v1

Compressor summary: The authors use Gaussian process models to analyze battery resistance, develop fault detection rules, and understand cell-level failure mechanisms for lithium iron phosphate batteries based on a large dataset of returned batteries.


VideoMambaPro: A Leap Forward for Mamba in Video Understanding

http://arxiv.org/abs/2406.19006v1

Compressor summary: VideoMambaPro improves video action recognition by addressing limitations in Mamba's token processing with masked backward computation and elemental residual connections, outperforming transformer models while being more efficient.


Improving Taxonomic Image-based Out-of-distribution Detection With DNA Barcodes

http://arxiv.org/abs/2406.18999v1

Compressor summary: The paper proposes a re-ordering approach to improve image-based species identification by using DNA barcodes to detect out-of-distribution classes.


Zero-shot domain adaptation based on dual-level mix and contrast

http://arxiv.org/abs/2406.18996v1

Compressor summary: The paper proposes a zero-shot domain adaptation method that uses data augmentation, domain adversarial learning, and dual-level contrastive learning to learn domain-invariant features with low task bias for the target task of interest.


Semi-supervised Concept Bottleneck Models

http://arxiv.org/abs/2406.18992v1

Compressor summary: SSCBM is a new framework that improves concept bottleneck models by using semi-supervised learning and aligning unlabeled data with concepts, making them more accurate and efficient even with limited labeled data.


A Fast Learning-Based Surrogate of Electrical Machines using a Reduced Basis

http://arxiv.org/abs/2406.18990v1

Compressor summary: The article proposes a real-time surrogate model for parameterized PDEs using a hybrid method of POD and SVR, aimed for interactive analysis in digital twins applications.


Structural Attention: Rethinking Transformer for Unpaired Medical Image Synthesis

http://arxiv.org/abs/2406.18967v1

Compressor summary: UNest is a novel unpaired medical image synthesis architecture that uses structural inductive biases and structural attention to improve the synthesis of anatomical regions, achieving significant improvements over recent methods on three modalities.


UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

http://arxiv.org/abs/2406.18966v1

Compressor summary: UniGen is a framework that uses large language models to generate diverse, accurate, and controllable text datasets for various applications, such as benchmarking and data augmentation.


AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

http://arxiv.org/abs/2406.18958v1

Compressor summary: AnyControl is a framework for image synthesis that handles diverse control signals and produces high-quality images faithful to the input text.


Alignment For Performance Improvement in Conversation Bots

http://arxiv.org/abs/2406.18954v1

Compressor summary: Alignment methods improve bot performance in following rules compared to instruction fine-tuning alone.


Investigating and Defending Shortcut Learning in Personalized Diffusion Models

http://arxiv.org/abs/2406.18944v1

Compressor summary: This paper analyzes the vulnerability of personalized diffusion models to adversarial perturbations and proposes a method to align the latent image with its semantic meaning and contrastive learning to prevent performance degradation.


CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation

http://arxiv.org/abs/2406.18941v1

Compressor summary: CLIP3D-AD is a novel 3D few-shot anomaly detection method that adapts CLIP for classification and segmentation using synthesized anomalous images, image and text adapters, and multi-view fusion.


Evaluating AI Group Fairness: a Fuzzy Logic Perspective

http://arxiv.org/abs/2406.18939v1

Compressor summary: The text proposes a fuzzy logic framework to evaluate and standardize different definitions of group fairness in AI systems, considering uncertain and context-specific beliefs.


Semi-adaptive Synergetic Two-way Pseudoinverse Learning System

http://arxiv.org/abs/2406.18931v1

Compressor summary: The paper proposes a new learning system that simplifies hyperparameter tuning and improves training efficiency using semi-adaptive synergetic two-way pseudoinverse learning subsystems trained without gradient descent.


Reasoning About Action and Change

http://arxiv.org/abs/2406.18930v1

Compressor summary: The book covers various aspects of AI research, from fundamentals to applications, and targets students and professionals interested in the field.


RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation

http://arxiv.org/abs/2406.18927v1

Compressor summary: The paper proposes a novel method for rectifying deviated fisheye images by learning a distortion vector map that captures local distortion features and improves performance with data augmentation.


Fine-tuned network relies on generic representation to solve unseen cognitive task

http://arxiv.org/abs/2406.18926v1

Compressor summary: The study compares two GPT-2 models on a novel decision-making task, showing that fine-tuned models rely more on pretrained representations than scratch-trained ones, which develop more specific mechanisms.


Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

http://arxiv.org/abs/2406.18925v1

Compressor summary: The paper introduces VisArgs, a dataset for evaluating AI's ability to understand visual arguments, and shows that current AI models struggle with identifying relevant visual cues and perform better when given them as input.


Learning Pareto Set for Multi-Objective Continuous Robot Control

http://arxiv.org/abs/2406.18924v1

Compressor summary: The paper proposes an efficient multi-objective reinforcement learning algorithm that learns a continuous representation of the Pareto set using a single hypernet, achieving better performance on robot control problems.


Time Matters: Scaling Laws for Any Budget

http://arxiv.org/abs/2406.18922v1

Compressor summary: The paper proposes a better way to estimate training time and loss for transformer models based on memory copies, allowing for more efficient architecture design.


Capturing Minds, Not Just Words: Enhancing Role-Playing Language Models with Personality-Indicative Data

http://arxiv.org/abs/2406.18921v1

Compressor summary: The paper proposes using psychological questions to enhance small role-playing language models, improving their dialogue generation and character portrayal.


TrustUQA: A Trustful Framework for Unified Structured Data Question Answering

http://arxiv.org/abs/2406.18916v1

Compressor summary: UnifiedTQA is a trustful question answering framework that supports multiple types of structured data using a Condition Graph representation and a two-level querying method with dynamic demonstration retrieval.


Factor-Conditioned Speaking-Style Captioning

http://arxiv.org/abs/2406.18910v1

Compressor summary: The paper introduces a method for generating diverse and accurate speaking-style captions by first predicting speaking-style factors and then sampling from them.


A Universal Railway Obstacle Detection System based on Semi-supervised Segmentation And Optical Flow

http://arxiv.org/abs/2406.18908v1

Compressor summary: The authors propose a semi-supervised segmentation method using synthetic images and optical flow clues for detecting obstacles in railway scenarios with varying conditions.


Historia Magistra Vitae: Dynamic Topic Modeling of Roman Literature using Neural Embeddings

http://arxiv.org/abs/2406.18907v1

Compressor summary: The authors compare traditional and BERT-based dynamic topic models for Roman literature and find that while statistics favor the former, insights from the latter are better and it's easier to use.


Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

http://arxiv.org/abs/2406.18906v1

Compressor summary: The authors develop a task to assess how well large language models can recognize different poetic forms and elements, and discuss the challenges of creating benchmarks for poetry and other creative tasks.


Autoencoder based approach for the mitigation of spurious correlations

http://arxiv.org/abs/2406.18901v1

Compressor summary: The paper proposes an autoencoder-based method to analyze and reduce spurious correlations, improving out-of-distribution generalization for deep neural networks on the Global Wheat Head Detection dataset.


360 in the Wild: Dataset for Depth Prediction and View Synthesis

http://arxiv.org/abs/2406.18898v1

Compressor summary: The paper introduces a large-scale 360° videos dataset collected from diverse real-world environments, with pose and depth information, for learning-based tasks such as depth estimation and view synthesis.


Can we teach language models to gloss endangered languages?

http://arxiv.org/abs/2406.18895v1

Compressor summary: The paper explores using large language models for generating interlinear glossed text without any training and shows that targeted example selection can improve performance.


AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

http://arxiv.org/abs/2406.18893v1

Compressor summary: The paper proposes new methods to improve text-to-image customization by aligning generated images with user-supplied reference images and fixing issues in existing methods.


Sequential three-way group decision-making for double hierarchy hesitant fuzzy linguistic term set

http://arxiv.org/abs/2406.18884v1

Compressor summary: The paper proposes a novel multi-level sequential three-way decision for group decision-making (S3W-GDM) method that considers vagueness, hesitation, and variation in GDM problems using granular computing to improve efficiency.


SSP: Self-Supervised Prompting for Cross-Lingual Transfer to Low-Resource Languages using Large Language Models

http://arxiv.org/abs/2406.18880v1

Compressor summary: The paper explores how large language models can be used for low-resource languages tasks without labeled data by using a novel in-context learning approach called Self-Supervised Prompting (SSP).


Efficacy of Language Model Self-Play in Non-Zero-Sum Games

http://arxiv.org/abs/2406.18872v1

Compressor summary: Self-play improves language models in both cooperative and competitive settings, even when objectives change from cooperation to competition or vice versa.


Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models

http://arxiv.org/abs/2406.18868v1

Compressor summary: RAIL is a method to learn from multiple domains without forgetting or relying on extra data, while preserving VLM's zero-shot ability using recursive ridge regression and feature projection.


From Biased Selective Labels to Pseudo-Labels: An Expectation-Maximization Framework for Learning from Biased Decisions

http://arxiv.org/abs/2406.18865v1

Compressor summary: DCEM is an algorithm for learning from selective labels with disparate censorship, which mitigates labeling biases without compromising performance on synthetic and clinical data.


Learning Modality Knowledge Alignment for Cross-Modality Transfer

http://arxiv.org/abs/2406.18864v1

Compressor summary: The paper studies how modality gap affects cross-modality transfer and proposes MoNA, a meta-learning method that reduces the gap by transforming target data.


Predicting the duration of traffic incidents for Sydney greater metropolitan area using machine learning methods

http://arxiv.org/abs/2406.18861v1

Compressor summary: The researchers developed a machine learning model to predict traffic incident durations in Sydney, using data on road characteristics, incidents, and socio-economic factors, and found that XGBoost performed best.


Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification

http://arxiv.org/abs/2406.18859v1

Compressor summary: The study investigates if large language models can generate patient-friendly versions of radiology reports using self-correction prompts and a new evaluation method with radiologists and laypeople.


FFN: a Fine-grained Chinese-English Financial Domain Parallel Corpus

http://arxiv.org/abs/2406.18856v1

Compressor summary: The paper introduces a Chinese-English financial news dataset (FFN) and evaluates ChatGPT, ERNIE-bot, and OpenNMT models for financial machine translation, highlighting challenges and opportunities in this domain.


What Is Missing In Homophily? Disentangling Graph Homophily For Graph Neural Networks

http://arxiv.org/abs/2406.18854v1

Compressor summary: The paper proposes Tri-Hom, a new composite metric that considers three aspects of graph homophily and shows its effectiveness in understanding the performance of Graph Neural Networks.


Decoding-Time Language Model Alignment with Multiple Objectives

http://arxiv.org/abs/2406.18853v1

Compressor summary: The paper proposes a multi-objective decoding algorithm that combines multiple language models to better serve diverse user needs and shows its effectiveness in improving various metrics.


LICO: Large Language Models for In-Context Molecular Optimization

http://arxiv.org/abs/2406.18851v1

Compressor summary: LICO is a model that enhances large language models for black-box optimization in the molecular domain using in-context predictions.


Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

http://arxiv.org/abs/2406.18849v1

Compressor summary: The paper introduces Dysca, a dynamic and scalable benchmark for evaluating large vision-language models on novel images, questions, and answers across different styles, scenarios, and question types.


Temporally Multi-Scale Sparse Self-Attention for Physical Activity Data Imputation

http://arxiv.org/abs/2406.18848v1

Compressor summary: The paper proposes a sparse self-attention model using domain knowledge to impute missing hourly step count data from wearable sensors in real-world settings.


Learning Retrieval Augmentation for Personalized Dialogue Generation

http://arxiv.org/abs/2406.18847v1

Compressor summary: The paper proposes LAPDOG, a method that uses external knowledge and story retrieval to generate personalized dialogues based on persona profiles.


Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition

http://arxiv.org/abs/2406.18845v1

Compressor summary: The paper proposes a dual-stream framework for event stream-based pattern recognition that uses differentiated fusion, Transformer, GNN, and a hybrid interaction readout mechanism to achieve state-of-the-art performance on multiple datasets.


Revisiting Backdoor Attacks against Large Vision-Language Models

http://arxiv.org/abs/2406.18844v1

Compressor summary: This paper investigates the generalizability of backdoor attacks on large vision-language models during instruction tuning and proposes modifications to improve attack effectiveness across different domains.


Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

http://arxiv.org/abs/2406.18839v1

Compressor summary: The study proposes using simpler questions before retrieving visual or non-visual information to improve knowledge-based visual question-answering performance on three datasets.


Dense Monocular Motion Segmentation Using Optical Flow and Pseudo Depth Map: A Zero-Shot Approach

http://arxiv.org/abs/2406.18837v1

Compressor summary: Our hybrid approach combines deep learning and optical flow to perform motion segmentation without training data, using object proposals, clustering, and depth maps as cues.


Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs

http://arxiv.org/abs/2406.18836v1

Compressor summary: The paper presents a new image retrieval method using masked image-text pairs to learn the relationship between a query image and text, improving accuracy.


OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

http://arxiv.org/abs/2406.18832v1

Compressor summary: OutlierTune is an efficient method for quantizing large language models' activations by pre-executing dequantization and symmetrization, achieving better generalization and hardware efficiency.


Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis

http://arxiv.org/abs/2406.18817v1

Compressor summary: Key points: - A novel non-rigid point set registration method based on clustering centroids and members - Tikhonov regularization with an $\ell_1$-induced Laplacian kernel for smooth and robust displacement fields - Clustering-improved Nystr"om method to reduce computational complexity and storage of the Gram matrix - High accuracy, low-dimensionality, and ability to handle large deformations Summary: The paper proposes a new non-rigid point set registration method that uses clustering analysis, Tikhonov regularization with an $\ell_1$ kernel, and a clustering-improved Nystr"om method to achieve high accuracy, low complexity, and robustness.


Divide, Ensemble and Conquer: The Last Mile on Unsupervised Domain Adaptation for On-Board Semantic Segmentation

http://arxiv.org/abs/2406.18809v1

Compressor summary: DEC is a flexible framework for unsupervised domain adaptation in semantic segmentation that uses synthetic multi-source datasets to improve performance on real-world datasets.


Online Stackelberg Optimization via Nonlinear Control

http://arxiv.org/abs/2406.18805v1

Compressor summary: The text proposes a unified algorithmic framework for minimizing regret in interactive problems with adaptive agents, by casting them as online control problems and analyzing their properties.


All Random Features Representations are Equivalent

http://arxiv.org/abs/2406.18802v1

Compressor summary: Random feature techniques can approximate positive-definite kernels with infinite-dimensional dot products using optimal sampling policies.


Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think

http://arxiv.org/abs/2406.18800v1

Compressor summary: Infinite-width NTK models can access richer features than finite models, but their performance is limited by weak optimizers like SGD.