arxiv compressed, 2024-06-18

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-06-18 generated by the compressor, my personal LLM-based project.


LLaNA: Large Language and NeRF Assistant

http://arxiv.org/abs/2406.11840v1

Compressor summary: The paper introduces LLaNA, a NeRF-language assistant that can perform tasks like captioning and Q&A using NeRF weights instead of images or 3D data.


Autoregressive Image Generation without Vector Quantization

http://arxiv.org/abs/2406.11838v1

Compressor summary: The authors propose a diffusion loss function for autoregressive models that allows them to operate in a continuous-valued space without vector quantization, achieving strong image generation results and speed advantages.


mDPO: Conditional Preference Optimization for Multimodal Large Language Models

http://arxiv.org/abs/2406.11839v1

Compressor summary: mDPO is a new method to better align multimodal language models by optimizing image preferences and avoiding reward decay, leading to improved performance and reduced hallucination.


Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

http://arxiv.org/abs/2406.11837v1

Compressor summary: The authors propose VQGAN-LC, a novel image quantization model that uses a large codebook (100,000) to improve performance on various tasks.


OoDIS: Anomaly Instance Segmentation Benchmark

http://arxiv.org/abs/2406.11835v1

Compressor summary: The paper introduces a new benchmark for testing anomaly instance segmentation in autonomous vehicles, which is essential for identifying unknown objects like wild animals to prevent accidents.


RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians

http://arxiv.org/abs/2406.11836v1

Compressor summary: The authors propose RetinaGS, a general model parallel training method for 3D Gaussian splatting models, which enables scaling to large numbers of primitives and high resolutions, improving reconstruction quality.


MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

http://arxiv.org/abs/2406.11833v1

Compressor summary: MMDU is a benchmark and dataset to evaluate and improve large vision-language models' abilities in multi-turn and multi-image conversations, which are essential for real-world human-AI interaction applications.


Unveiling Encoder-Free Vision-Language Models

http://arxiv.org/abs/2406.11832v1

Compressor summary: The paper introduces EVE, an encoder-free vision-language model that can rival encoder-based models using only 35M data, by bridging vision-language representation in a unified decoder and enhancing visual recognition with extra supervision.


Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

http://arxiv.org/abs/2406.11831v1

Compressor summary: Key points: - Large language models (LLMs) are good at text understanding but not well used in text-to-image diffusion models - Misalignment between LLM training and prompt encoding, and positional bias of decoder-only architecture are the main obstacles - A novel framework is proposed to improve text representation and integrate multiple LLMs into text-to-image generation model - LI-DiT, a new model based on the framework, achieves state-of-the-art performance in prompt understanding and image generation Summary: The paper proposes a new framework to enhance the use of large language models in text-to-image diffusion models by improving text representation and fusing multiple LLMs. The resulting LI-DiT model outperforms existing open-source and closed-source models.


Language Modeling with Editable External Knowledge

http://arxiv.org/abs/2406.11830v1

Compressor summary: The paper introduces ERASE, a method that updates language models to reflect changes in the world by deleting or rewriting existing documents when new ones are added, improving accuracy in answering questions about news articles and conversations.


Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

http://arxiv.org/abs/2406.11828v1

Compressor summary: The paper investigates how efficiently a two-layer neural network can learn a target function that is a combination of many nonlinear functions, depending on the number of tasks and the information exponent of each function.


WPO: Enhancing RLHF with Weighted Preference Optimization

http://arxiv.org/abs/2406.11827v1

Compressor summary: Weighted Preference Optimization (WPO) is a novel method to improve reinforcement learning from human feedback by adapting off-policy data to resemble on-policy data, leading to better alignment with human values and enhanced performance.


Spectral Introspection Identifies Group Training Dynamics in Deep Neural Networks for Neuroimaging

http://arxiv.org/abs/2406.11825v1

Compressor summary: The authors propose a new method to analyze deep learning models on neuroimaging data during training, using gradient computations and singular value decomposition, which can help understand and prevent issues like bias and overfitting.


Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation

http://arxiv.org/abs/2406.11824v1

Compressor summary: Infinigen Indoors is a tool that generates photorealistic indoor scenes using procedural generation and constraints, designed to train embodied agents in real-time simulators.


On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

http://arxiv.org/abs/2406.11823v1

Compressor summary: This study optimizes vision-language models for efficient computation while maintaining high performance, and shares the resources on GitHub.


Composing Object Relations and Attributes for Image-Text Matching

http://arxiv.org/abs/2406.11820v1

Compressor summary: The authors propose a fast image-text matching model using scene graphs to represent captions and graph attention networks to learn object-attribute and object-object relations, outperforming existing cross-attention methods in recall and speed.


MegaScenes: Scene-Level View Synthesis at Scale

http://arxiv.org/abs/2406.11819v1

Compressor summary: The paper introduces MegaScenes, a large-scale scene-level dataset for novel view synthesis, addressing challenges like lighting and transient objects in Internet photos.


Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

http://arxiv.org/abs/2406.11817v1

Compressor summary: Iterative length-regularized Direct Preference Optimization (iLR-DPO) improves response quality without increasing verbosity, leading to a strong 7B language model that performs on par with GPT-4.


VideoLLM-online: Online Video Large Language Model for Streaming Video

http://arxiv.org/abs/2406.11816v1

Compressor summary: The LIVE framework enables real-time conversation within a continuous video stream using large multimodal models, improving efficiency and performance for streaming videos.


How Do Large Language Models Acquire Factual Knowledge During Pretraining?

http://arxiv.org/abs/2406.11813v1

Compressor summary: This study investigates how large language models acquire factual knowledge during pretraining and finds that more data does not always help, forgetting occurs over time, and batch size affects robustness to forgetting.


RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

http://arxiv.org/abs/2406.11811v1

Compressor summary: The authors introduce RepLiQA, a new test dataset for question-answering and topic retrieval tasks, which contains imaginary scenarios not present on the internet to avoid leaking into LLM training sets.


Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics

http://arxiv.org/abs/2406.11810v1

Compressor summary: The paper proposes a computationally efficient Reinforcement Learning algorithm for linear Bellman complete settings that uses random noise injection in least square regression problems to perform optimistic value iteration.


Physics-Constrained Learning for PDE Systems with Uncertainty Quantified Port-Hamiltonian Models

http://arxiv.org/abs/2406.11809v1

Compressor summary: The text describes a physics-constrained learning method that uses data and physical models to predict the nonlinear dynamics of flexible objects, such as soft robots, while also accounting for uncertainty and compositionality.


Faces of Experimental Pain: Transferability of Deep Learned Heat Pain Features to Electrical Pain

http://arxiv.org/abs/2406.11808v1

Compressor summary: Key points: - Pain recognition using deep learning is challenged by small datasets - The study investigates transferability of heat pain model to electrical pain - They use an existing CNN as a feature extractor and train two ML models - Their approach outperforms the baseline in the AI4Pain challenge Summary: The authors propose a method to recognize electrical pain using a pre-trained heat pain CNN as a feature extractor, which beats the baseline in the AI4Pain challenge.


Efficient Discovery of Significant Patterns with Few-Shot Resampling

http://arxiv.org/abs/2406.11803v1

Compressor summary: FSR is an efficient algorithm for finding statistically significant patterns in transactional data using resampled datasets with rigorous guarantees on false discoveries.


PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models

http://arxiv.org/abs/2406.11802v1

Compressor summary: PhyBench is a new evaluation dataset that tests text-to-image models' physical commonsense abilities by generating images based on prompts with physical principles, revealing their limitations and the need for better reasoning.


Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

http://arxiv.org/abs/2406.11801v1

Compressor summary: Safety Arithmetic is a framework to improve large language models' safety by removing harmful content and aligning them with human values without training.


DataComp-LM: In search of the next generation of training sets for language models

http://arxiv.org/abs/2406.11794v1

Compressor summary: DataComp for Language Models (DCLM) is a testbed to improve language models through controlled dataset experiments, providing a standardized corpus, pretraining recipes, and 53 evaluations, with the baseline DCLM-Baseline model achieving competitive results on MMLU and NLP tasks with less compute.


CELL your Model: Contrastive Explanation Methods for Large Language Models

http://arxiv.org/abs/2406.11785v1

Compressor summary: This paper proposes contrastive explanation methods for generative AI like LLMs, which explain their responses by showing how prompt modifications would change the output.


MDCR: A Dataset for Multi-Document Conditional Reasoning

http://arxiv.org/abs/2406.11784v1

Compressor summary: The text introduces MDCR, a new dataset for evaluating models' ability to answer complex conditional reasoning questions requiring cross-document optimization, which reveals the limitations of current LLMs in solving such tasks.


Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs

http://arxiv.org/abs/2406.11780v1

Compressor summary: SPUNGE is a framework that enhances machine unlearning by using data attributes to improve safety and performance of large language models.


Provable Guarantees for Model Performance via Mechanistic Interpretability

http://arxiv.org/abs/2406.11779v1

Compressor summary: The authors propose using mechanistic interpretability techniques to derive and prove formal guarantees on model accuracy and investigate the relationship between proof length, understanding, and bound tightness.


Improving Multi-Agent Debate with Sparse Communication Topology

http://arxiv.org/abs/2406.11776v1

Compressor summary: This paper explores how sparse communication among agents in multi-agent debates can improve language model quality, reduce costs, and extend to other tasks.


Task Me Anything

http://arxiv.org/abs/2406.11775v1

Compressor summary: Task-Me-Anything is a benchmark generation engine that creates tailored multimodal tasks to evaluate large language models' capabilities across various domains.


Optimal Transport-Assisted Risk-Sensitive Q-Learning

http://arxiv.org/abs/2406.11774v1

Compressor summary: The paper introduces a safe reinforcement learning algorithm using optimal transport theory to balance performance and safety in decision-making policies.


Deep Learning methodology for the identification of wood species using high-resolution macroscopic images

http://arxiv.org/abs/2406.11772v1

Compressor summary: The text proposes a new method to identify wood species using high-resolution images and a voting process, and introduces a new data set of macroscopic timber images.


Solving Vision Tasks with Simple Photoreceptors Instead of Cameras

http://arxiv.org/abs/2406.11769v1

Compressor summary: The text discusses how simple visual sensors with few photoreceptors can perform well on computer vision tasks and how a computational algorithm can help optimize their design.


Matching Query Image Against Selected NeRF Feature for Efficient and Scalable Localization

http://arxiv.org/abs/2406.11766v1

Compressor summary: MatLoc-NeRF is a new method for efficient and accurate 3D scene localization using selective NeRF features, pose-aware scene partitioning, and coarse initial pose estimation.


STAR: SocioTechnical Approach to Red Teaming Language Models

http://arxiv.org/abs/2406.11757v1

Compressor summary: STAR is a framework that helps test the safety of large language models by generating instructions for human red teamers, matching demographics for risk assessment, and leveraging diverse viewpoints for label reliability.


A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models

http://arxiv.org/abs/2406.11753v1

Compressor summary: The authors propose a method for determining where to finetune language models based on analyzing their semantic inference process, which improves efficiency and effectiveness over existing baselines.


Domain Generalization for In-Orbit 6D Pose Estimation

http://arxiv.org/abs/2406.11743v1

Compressor summary: Our method trains a neural network to estimate the position and orientation of a spacecraft from monocular images using multi-task learning and data augmentation, improving domain generalization and achieving state-of-the-art results.


Transcendence: Generative Models Can Outperform The Experts That Train Them

http://arxiv.org/abs/2406.11741v1

Compressor summary: The paper explores how generative models can sometimes surpass human experts' abilities when trained on their data, using chess playing as an example and showing that low-temperature sampling enables this transcendence.


V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results

http://arxiv.org/abs/2406.11739v1

Compressor summary: The V3Det Challenge 2024 is a benchmark for object detection in real-world scenes with various categories and unseen objects, aiming to advance the field and inspire innovation.


InterNeRF: Scaling Radiance Fields via Parameter Interpolation

http://arxiv.org/abs/2406.11737v1

Compressor summary: InterNeRF is a novel architecture that improves NeRFs' scalability for large, real-world scenes by enabling out-of-core training and rendering with modest increase to training time.


Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models

http://arxiv.org/abs/2406.11736v1

Compressor summary: The paper proposes a neural-symbolic self-training method (ENVISIONS) that leverages environment feedback to improve LLMs' performance in natural language and symbolic domains with limited data.


Correspondence Free Multivector Cloud Registration using Conformal Geometric Algebra

http://arxiv.org/abs/2406.11732v1

Compressor summary: The paper introduces a new method for registering multivector clouds in conformal geometric algebra without solving correspondences, using orthogonal transformations from $SO(4,1)$.


Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity

http://arxiv.org/abs/2406.11721v1

Compressor summary: This paper investigates how zero-shot generalization in instruction tuning works and proposes a new data arrangement method to improve it.


Refusal in Language Models Is Mediated by a Single Direction

http://arxiv.org/abs/2406.11717v1

Compressor summary: The study identifies a one-dimensional subspace that mediates refusal behavior in large language models and proposes a method to disable it, revealing the brittleness of current safety fine-tuning methods.


Measuring memorization in RLHF for code completion

http://arxiv.org/abs/2406.11715v1

Compressor summary: This paper investigates how reinforcement learning with human feedback (RLHF) affects training data memorization and privacy concerns in code completion models.


Scalable Expressiveness through Preprocessed Graph Perturbations

http://arxiv.org/abs/2406.11714v1

Compressor summary: SE2P is a scalable and configurable graph neural network model that balances expressiveness and generalization by perturbing input graphs and adjusting learnable features.


Latent Denoising Diffusion GAN: Faster sampling, Higher image quality

http://arxiv.org/abs/2406.11713v1

Compressor summary: Latent Denoising Diffusion GAN improves diffusion models' inference speed, image quality, and diversity by using pre-trained autoencoders and a weighted learning strategy.


OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations

http://arxiv.org/abs/2406.11711v1

Compressor summary: OGNI-DC is a novel framework for depth completion that uses Optimization-Guided Neural Iterations to refine depth gradients and generate dense depth maps with high accuracy and generalization.


Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging

http://arxiv.org/abs/2406.11709v1

Compressor summary: TreeInstruct is a state space-based planning algorithm that uses Socratic questioning to help students independently identify and resolve coding errors in a multi-turn interaction setting, outperforming baselines and guiding students efficiently.


Nemotron-4 340B Technical Report

http://arxiv.org/abs/2406.11704v1

Compressor summary: The authors introduce Nemotron-4 340B models, which are competitive language models that can generate synthetic data for training smaller models and are available under a permissive license.


Multiple Descents in Unsupervised Learning: The Role of Noise, Domain Shift and Anomalies

http://arxiv.org/abs/2406.11703v1

Compressor summary: The text discusses double descent in unsupervised learning using under-complete auto-encoders and its applications for dealing with noisy data, domain shifts, and anomalies.


Meta Reasoning for Large Language Models

http://arxiv.org/abs/2406.11698v1

Compressor summary: Meta-Reasoning Prompting (MRP) is a new method that helps large language models select and apply the best reasoning methods for different tasks, improving their performance and efficiency across various problem domains.


Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

http://arxiv.org/abs/2406.11695v1

Compressor summary: The paper proposes MIPRO, a novel optimizer for language model programs, which improves their performance by crafting task-grounded instructions and navigating credit assignment across modules.


Lightweight Model Pre-training via Language Guided Knowledge Distillation

http://arxiv.org/abs/2406.11689v1

Compressor summary: The paper proposes Language-Guided Distillation (LGD), a new method that uses category names to improve knowledge transfer from a large network to a smaller network for mobile devices.


Tokenization Falling Short: The Curse of Tokenization

http://arxiv.org/abs/2406.11687v1

Compressor summary: The study examines the challenges of tokenization for large language models, its impact on problem solving and resilience to typos, and suggests subword regularization as a potential solution.


The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation

http://arxiv.org/abs/2406.11686v1

Compressor summary: The paper studies offline RL with linear function approximation and gives a fast algorithm that works under low inherent Bellman error, achieving suboptimality that scales with $\sqrt{\varepsilon_{\mathrm{BE}}}$, which is optimal for any algorithm in this setting.


Edge Classification on Graphs: New Directions in Topological Imbalance

http://arxiv.org/abs/2406.11685v1

Compressor summary: The paper proposes TopoEdge, a novel approach to address topological imbalance in edge classification tasks using topological entropy as a metric to measure and mitigate the issue.


HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing

http://arxiv.org/abs/2406.11683v1

Compressor summary: HoLLMwood is a framework that uses large language models to create screenplays by assigning them different roles, such as writer, editor, and actor, mimicking the human creative process.


Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack

http://arxiv.org/abs/2406.11682v1

Compressor summary: Key points: - The paper proposes a new task, knowledge-to-jailbreak, to test the domain-specific safety of LLMs - The task involves generating jailbreaks from domain knowledge that can harm LLMs - The paper collects a large dataset and fine-tunes a model for this task - Experiments show the effectiveness and generalizability of the method Summary: The paper introduces a new way to test if LLMs are safe in different domains by generating domain knowledge-specific jailbreaks that can harm them, and demonstrates its feasibility with a large dataset and a fine-tuned model.


R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models

http://arxiv.org/abs/2406.11681v1

Compressor summary: The paper introduces R-Eval, a toolkit for evaluating Retrieval-Augmented Large Language Models on various tasks and domains, revealing their effectiveness variations.


Score-fPINN: Fractional Score-Based Physics-Informed Neural Networks for High-Dimensional Fokker-Planck-Levy Equations

http://arxiv.org/abs/2406.11676v1

Compressor summary: The authors propose a novel method using fractional score functions and physical-informed neural networks to solve high-dimensional Fokker-Planck-L'evy equations, overcoming the curse of dimensionality and numerical overflow issues.


BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

http://arxiv.org/abs/2406.11675v1

Compressor summary: The paper proposes BLoB, a method that adapts LLMs' parameters during fine-tuning to improve generalization and uncertainty estimation.


Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference

http://arxiv.org/abs/2406.11674v1

Compressor summary: The paper proposes Endor, a sparse format that compresses pruned LLM weights to reduce weight transfer latency and accelerate offloaded inference on resource-constrained platforms.


Effective Rank Analysis and Regularization for Enhanced 3D Gaussian Splatting

http://arxiv.org/abs/2406.11672v1

Compressor summary: The paper proposes using effective rank analysis and regularization to improve the quality of 3D Gaussian Splatting, a technique for real-time rendering with high-quality 3D reconstruction, by reducing needle-like artifacts and enhancing normal and geometry reconstruction.


Benchmarking of LLM Detection: Comparing Two Competing Approaches

http://arxiv.org/abs/2406.11670v1

Compressor summary: The article summarizes existing LLM text recognition methods, highlights issues with current benchmarking datasets, and introduces a new evaluation dataset to compare different detectors.


"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

http://arxiv.org/abs/2406.11668v1

Compressor summary: BabyBLUE is a new benchmark for evaluating jailbreaks and hallucinations in large language models, improving safety and reliability.


Is Efficient PAC Learning Possible with an Oracle That Responds 'Yes' or 'No'?

http://arxiv.org/abs/2406.11667v1

Compressor summary: The paper shows that it is possible to learn from a weaker oracle than ERM and asks if the ERM principle is necessary for efficient learning.


See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

http://arxiv.org/abs/2406.11665v1

Compressor summary: The text discusses how vision-language models have a Western bias in image understanding and suggests pre-training with diverse languages to improve equity.


Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting

http://arxiv.org/abs/2406.11661v1

Compressor summary: The paper investigates how four LLMs respond to culturally sensitive and neutral prompts on different datasets, finding significant variations in their answers except for GPT-4, questioning the effectiveness of socio-demographic prompting as a method for studying or aligning models with culture.


Can LLM be a Personalized Judge?

http://arxiv.org/abs/2406.11657v1

Compressor summary: The paper explores the reliability of using large language models as personalized judges for user preferences based on personas, and proposes verbal uncertainty estimation to improve their performance.


Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

http://arxiv.org/abs/2406.11654v1

Compressor summary: Ruby Teaming is a method that enhances Rainbow Teaming by adding a memory cache, resulting in higher attack success rate and quality diversity of generated prompts.


A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4

http://arxiv.org/abs/2406.11651v1

Compressor summary: The paper proposes a zero-shot evaluation method for dialogue state tracking using GPT-4, which considers both accuracy and completeness and improves accuracy with manual reasoning paths.


AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection

http://arxiv.org/abs/2406.11643v1

Compressor summary: AnyMaker is a framework for generating general objects with high identity fidelity and flexible text editability using self-supervised models, dual-level ID injection, and ID-aware decoupling.


YOLO-FEDER FusionNet: A Novel Deep Learning Architecture for Drone Detection

http://arxiv.org/abs/2406.11641v1

Compressor summary: YOLO-FUSIONNET IMPROVES DRONE DETECTION IN COMPLEX ENVIRONMENTS BY COMBINING GENERIC OBJECT DETECTION AND CAMOUFLAGE OBJECT DETECTION TECHNIQUES.


Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions

http://arxiv.org/abs/2406.11640v1

Compressor summary: The paper presents a new polynomial-time algorithm for reinforcement learning with linear function approximation and Bellman completeness, which does not rely on global optimism or solving a nonconvex optimization problem.


MASAI: Modular Architecture for Software-engineering AI Agents

http://arxiv.org/abs/2406.11638v1

Compressor summary: MASAI is a modular architecture for software-engineering AI agents that uses different sub-agents with specialized objectives and strategies, improving performance on complex problems like GitHub issues resolution.


The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance

http://arxiv.org/abs/2406.11634v1

Compressor summary: Cloze testing measures large language models' behavior on benchmark tasks; using the MMLU dataset, they show that significant base-rate probability differences among answer tokens impact task performance, but counterfactual prompting reduces this effect, similar to human test-taking strategies.


Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation

http://arxiv.org/abs/2406.11632v1

Compressor summary: sMBR decoding uses synthetic sources from backward translation and a reference-free quality estimation metric to improve neural machine translation, outperforming QE reranking and being competitive with standard MBR decoding.


DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

http://arxiv.org/abs/2406.11633v1

Compressor summary: DocGenome is a structured document benchmark with four key characteristics, designed to improve understanding and processing of scientific documents by large models.


Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

http://arxiv.org/abs/2406.11629v1

Compressor summary: The paper explores how to use GPT-4 as a judge for evaluating LLMs with fewer biases using many-shot in-context prompts and symbol bias mitigation.


Words in Motion: Representation Engineering for Motion Forecasting

http://arxiv.org/abs/2406.11624v1

Compressor summary: The authors propose a method to control motion forecasting models using natural language inputs, making them more interpretable and easy to interact with.


Building Knowledge-Guided Lexica to Model Cultural Variation

http://arxiv.org/abs/2406.11622v1

Compressor summary: The paper proposes measuring regional cultural variation using language and knowledge-guided lexica, while pointing out limitations in current large language models.


DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

http://arxiv.org/abs/2406.11617v1

Compressor summary: DELLA is a new model merging technique that uses MAGPRUNE, a novel pruning method, to improve multitasking performance and releases its source code.


Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

http://arxiv.org/abs/2406.11614v1

Compressor summary: The paper introduces ConceptVectors, a dataset and methodology to evaluate unlearning in large language models by measuring changes in parametric knowledge traces of erased concepts, which improves current behavioral-based evaluations.


Long Code Arena: a Set of Benchmarks for Long-Context Code Models

http://arxiv.org/abs/2406.11612v1

Compressor summary: The authors introduce Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context, along with datasets, evaluation tools, and baselines.


Learning Hierarchical Semantic Classification by Grounding on Consistent Image Segmentations

http://arxiv.org/abs/2406.11608v1

Compressor summary: Key points: - The paper proposes a method for hierarchical semantic classification based on image segmentations - The method treats each level as a different task and uses a single recognition model with fine-to-coarse parsing - The method introduces a Tree-path KL Divergence loss to enforce consistency and accuracy across levels Summary: The paper presents a novel hierarchical semantic classification method that uses image segmentations, treats each level as a different task, and enforces consistency and accuracy with a new loss function.


Standardizing Structural Causal Models

http://arxiv.org/abs/2406.11601v1

Compressor summary: The paper proposes internally-standardized structural causal models (iSCMs), which reduce artifacts in synthetic datasets used for benchmarking causal structure learning algorithms, and are less identifiable from prior knowledge on the weights.


Understanding "Democratization" in NLP and ML Research

http://arxiv.org/abs/2406.11598v1

Compressor summary: The paper examines how the term "democratization" is used in natural language processing and machine learning publications, finding that it often refers to ease of access or use of technologies rather than engaging with theories of democracy.


On GNN explanability with activation rules

http://arxiv.org/abs/2406.11594v1

Compressor summary: Key points: - GNNs are effective graph machine learning models that need more trustworthiness - The paper proposes a method to mine activation rules in GNNs hidden layers to understand how they work - The method uses information theory and pattern languages to find and redescribe the rules - The rules can help explain GNN decisions and reveal hidden features - The method outperforms existing methods in explaining graph classification Summary: The paper presents a novel method to mine activation rules in GNNs hidden layers using information theory and pattern languages, which can improve trustworthiness and explainability of these powerful graph machine learning models.


ChildDiffusion: Unlocking the Potential of Generative AI and Controllable Augmentations for Child Facial Data using Stable Diffusion and Large Language Models

http://arxiv.org/abs/2406.11592v1

Compressor summary: The ChildDiffusion framework generates realistic child faces with various attributes and ethnicities, addressing privacy concerns, and can be used for various ML tasks.


Style Transfer with Multi-iteration Preference Optimization

http://arxiv.org/abs/2406.11581v1

Compressor summary: This paper proposes an improved preference optimization approach for text style transfer, which adapts techniques from statistical machine translation and incorporates exploration, contrastive sampling, pseudo-parallel generation, and dynamic weighted reward aggregation.


Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation

http://arxiv.org/abs/2406.11580v1

Compressor summary: The paper introduces Error Span Annotation, a hybrid evaluation method for machine translation that is faster, cheaper, and requires less expertise than existing methods.


Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

http://arxiv.org/abs/2406.11579v1

Compressor summary: Duoduo CLIP is a 3D representation learning model that uses multi-view images and leverages 2D priors from CLIP models for fine-tuning, resulting in better generalization, reduced GPU requirements, faster training time, and improved performance in object retrieval and text-and-shape alignment tasks.


Mathematical Entities: Corpora and Benchmarks

http://arxiv.org/abs/2406.11577v1

Compressor summary: The paper introduces annotated corpora for studying mathematical language and tests various natural language processing models on them, finding that they struggle with math terminology and definitions.


Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models

http://arxiv.org/abs/2406.11568v1

Compressor summary: The paper presents an E2E framework that uses large language models to decode invasive brain signals and improve speech neuroprosthesis, showing its potential for BCI applications.


Quaternion Generative Adversarial Neural Networks and Applications to Color Image Inpainting

http://arxiv.org/abs/2406.11567v1

Compressor summary: The paper introduces a Quaternion Generative Adversarial Neural Network model for color image inpainting, which uses quaternion deconvolution and batch normalization to improve stability and take advantage of channel correlations.


MEMLA: Enhancing Multilingual Knowledge Editing with Neuron-Masked Low-Rank Adaptation

http://arxiv.org/abs/2406.11566v1

Compressor summary: This paper introduces MKEB, a multilingual knowledge editing benchmark, and MEMLA, a method that improves multilingual knowledge editing by identifying and modifying knowledge neurons across 12 languages.


Extrinsic Evaluation of Cultural Competence in Large Language Models

http://arxiv.org/abs/2406.11565v1

Compressor summary: The paper evaluates how well language models can generate texts that are culturally sensitive when the prompts change based on different nationalities.


Intersymbolic AI: Interlinking Symbolic AI and Subsymbolic AI

http://arxiv.org/abs/2406.11563v1

Compressor summary: Intersymbolic AI combines symbolic and subsymbolic AI techniques to enhance AI effectiveness, similar to how human thought benefits from both conscious and subconscious processes.


An Imitative Reinforcement Learning Framework for Autonomous Dogfight

http://arxiv.org/abs/2406.11562v1

Compressor summary: The paper proposes a novel reinforcement learning framework that efficiently learns dogfight policies for UCAVs by imitating experts and autonomously exploring dynamic environments.


Input Conditioned Graph Generation for Language Agents

http://arxiv.org/abs/2406.11555v1

Compressor summary: The paper presents a method to create dynamic language agents using graphs and a pretrained LLM fine-tuned with reinforcement learning, which improves communication flow and accuracy in various domains.


Simple Yet Efficient: Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment

http://arxiv.org/abs/2406.11551v1

Compressor summary: The paper proposes a simple and efficient approach to improve fine-grained sketch-based image retrieval by enhancing feature alignment and sharing mutual information between sketches and images using dual weight-sharing networks, contrastive loss, and a learnable self-attention module.


GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations

http://arxiv.org/abs/2406.11547v1

Compressor summary: The paper introduces GECO, a gender-controlled text dataset, to evaluate the impact of biases on XAI methods for large language models and shows that fine-tuning embedding layers improves explanation performance.


Do Parameters Reveal More than Loss for Membership Inference?

http://arxiv.org/abs/2406.11544v1

Compressor summary: Membership inference attacks need white-box access to models, not just black-box access, as previously claimed; a new attack using inverse-Hessian vector products confirms this.


Improving Quality Control of Whole Slide Images by Explicit Artifact Augmentation

http://arxiv.org/abs/2406.11538v1

Compressor summary: The text proposes a method to generate artificial artifacts in histopathology images for training classification algorithms, improving artifact detection accuracy.


Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision Transformers

http://arxiv.org/abs/2406.11534v1

Compressor summary: Inpainting the Gaps (InG) is a novel evaluation framework that improves the perturbation test by inpainting partial or complete objects in an image, reducing test-time distribution shifts and providing more consistent evaluation scores for popular explanation methods of the Vision Transformer.


Explainable Artificial Intelligence and Multicollinearity : A Mini Review of Current Approaches

http://arxiv.org/abs/2406.11524v1

Compressor summary: This paper reviews current methods for Explainable Artificial Intelligence (XAI) that address the issue of multicollinearity, which occurs when features are highly correlated and can affect the interpretation of informative features in machine learning models.


FullCert: Deterministic End-to-End Certification for Training and Inference of Neural Networks

http://arxiv.org/abs/2406.11522v1

Compressor summary: The paper proposes FullCert, the first end-to-end certifier that provides robustness guarantees against both training and inference data attacks using a new library called BoundFlow.


HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model

http://arxiv.org/abs/2406.11519v1

Compressor summary: HyperSIGMA is a large transformer-based foundation model for hyperspectral image analysis that leverages sparse sampling attention, spectral enhancement, and a novel dataset to achieve state-of-the-art performance on various tasks.


Revisiting Spurious Correlation in Domain Generalization

http://arxiv.org/abs/2406.11517v1

Compressor summary: The authors propose a structural causal model to analyze spurious correlation in representation learning, and introduce a propensity score weighted estimator to control confounding bias for out-of-distribution generalization.


Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs

http://arxiv.org/abs/2406.11514v1

Compressor summary: CFMAD is a framework that uses counterfactual multi-agent debate to override LLMs' biases and improve their performance on natural language processing tasks.


Prior Normality Prompt Transformer for Multi-class Industrial Image Anomaly Detection

http://arxiv.org/abs/2406.11507v1

Compressor summary: PNPT is a novel method that uses normal semantics prompting to improve multi-class image anomaly detection by combining prior knowledge with sample characteristics in a dual-stream reconstruction model.


On the Feasibility of Fidelity$^-$ for Graph Pruning

http://arxiv.org/abs/2406.11504v1

Compressor summary: FiP is a framework that uses fidelity measure to create global masks for graph pruning and shows that general XAI methods perform better than GNN-specific ones.


GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation

http://arxiv.org/abs/2406.11503v1

Compressor summary: The authors propose a novel pipeline using GPT-4 and GPT-4V to generate geometry problems with aligned text and images, improving the geometric capabilities of multi-modal models on benchmarks.


Teleporter Theory: A General and Simple Approach for Modeling Cross-World Counterfactual Causality

http://arxiv.org/abs/2406.11501v1

Compressor summary: The text introduces a novel graphical modeling approach called teleporter theory to analyze counterfactual causality in complex machine learning applications using structural causal models.


CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG

http://arxiv.org/abs/2406.11497v1

Compressor summary: CrAM is a plug-and-play method that adjusts LLMs' attention scores based on document credibility to reduce misinformation in retrieval-augmented generation.


Interventional Imbalanced Multi-Modal Representation Learning via $β$-Generalization Front-Door Criterion

http://arxiv.org/abs/2406.11490v1

Compressor summary: Our paper proposes a causal model for multi-modal representation learning that balances predominant and auxiliary modalities, improving performance and exploration over existing methods.


Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency

http://arxiv.org/abs/2406.11486v1

Compressor summary: The paper explores how well LLMs perform in extracting temporal relations without fine-tuning, finding they struggle and have issues with uniqueness and transitivity, but improving accuracy by solving inconsistencies is not guaranteed.


Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms

http://arxiv.org/abs/2406.11481v1

Compressor summary: The text discusses various model-based and model-free approaches for Constrained RL in average reward MDPs, analyzing constraint violation and regret guarantees, assuming ergodic MDPs and considering weakly communicating MDPs.


Vocabulary Expansion for Low-resource Cross-lingual Transfer

http://arxiv.org/abs/2406.11477v1

Compressor summary: The paper explores sample-efficient strategies for adapting large language models to low-resource languages using vocabulary expansion, finding simpler heuristic-based initialization methods more efficient and robust than existing approaches.


How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment

http://arxiv.org/abs/2406.11474v1

Compressor summary: The paper explores how different parts of context text affect In-Context Alignment (ICA) in Large Language Models (LLMs), finding that examples are crucial for alignment, and that ICA performs better than parameter fine-tuning in some tasks.


Promises, Outlooks and Challenges of Diffusion Language Modeling

http://arxiv.org/abs/2406.11473v1

Compressor summary: SEDD is a fast and promising alternative to autoregressive LLMs, but has some weaknesses in conditional generation with short prompts.


Learning from Exemplars for Interactive Image Segmentation

http://arxiv.org/abs/2406.11472v1

Compressor summary: Our interactive segmentation frameworks use transformer backbones, exemplar-informed modules, and cross-attention blocks to refine masks for single or multiple objects in the same category, reducing users' labor and clicks compared to previous methods.


Automating Easy Read Text Segmentation

http://arxiv.org/abs/2406.11464v1

Compressor summary: The paper explores new methods for automatically breaking sentences into smaller parts to help people with reading difficulties, and evaluates their effectiveness in different languages.


Just How Flexible are Neural Networks in Practice?

http://arxiv.org/abs/2406.11463v1

Compressor summary: The study examines how neural networks fit data in practice, finding that optimizers, architecture, and activation functions influence the capacity to fit training data and generalization.


TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation

http://arxiv.org/abs/2406.11460v1

Compressor summary: TRACE is a method for improving multi-hop question answering by constructing knowledge-grounded reasoning chains from retrieved documents.


Adversaries With Incentives: A Strategic Alternative to Adversarial Robustness

http://arxiv.org/abs/2406.11458v1

Compressor summary: Strategic training defends against adversaries by modeling their goals rather than harming performance and uses incentive uncertainty to guide learning.


Calibrating Where It Matters: Constrained Temperature Scaling

http://arxiv.org/abs/2406.11456v1

Compressor summary: The text discusses how to improve the performance of convolutional neural networks for medical image analysis by adjusting their calibration in regions that affect decision making.


Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction

http://arxiv.org/abs/2406.11455v1

Compressor summary: Key points: - The paper proposes a two-stage multi-step method to improve LLMs' performance on complex information extraction tasks - The method uses an RL framework and DDQN algorithm to learn the optimal order for sequential entity extraction - The method is evaluated on multiple public datasets and shows effectiveness Summary: The paper introduces a novel method that uses reinforcement learning and deep Q-networks to teach large language models how to extract information from complex sentences in an optimal sequence, improving their performance on information extraction tasks.


MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

http://arxiv.org/abs/2406.11451v1

Compressor summary: The paper introduces MedThink, a method that mimics human cognition to create fine-grained instruction pairs for LVLMs in medical image report generation tasks, improving their performance and reducing hallucinations.


Solving the Inverse Problem of Electrocardiography for Cardiac Digital Twins: A Survey

http://arxiv.org/abs/2406.11445v1

Compressor summary: The paper reviews methods to solve the ECG inverse problem, which helps create personalized virtual heart models that can improve cardiology care.


PrAViC: Probabilistic Adaptation Framework for Real-Time Video Classification

http://arxiv.org/abs/2406.11443v1

Compressor summary: Key points: - Video processing has two categories: whole video and real-time processing - Real-time processing aims to identify critical situations quickly - The paper proposes a novel framework for online classification of video data - The framework uses a mathematical function to encourage faster decisions - The framework adapts offline models to online and recurrent operations - The framework outperforms non-online methods in accuracy and speed Summary: The paper presents a new framework for online video processing that adapts offline models, encourages fast decisions with a mathematical function, and achieves better accuracy and speed than non-online methods.


SWCF-Net: Similarity-weighted Convolution and Local-global Fusion for Efficient Large-scale Point Cloud Semantic Segmentation

http://arxiv.org/abs/2406.11441v1

Compressor summary: Key points: - The paper proposes SWCF-Net, which combines local and global features for efficient segmentation of large-scale point clouds. - It uses Similarity-Weighted Convolution (SWConv) to enhance local feature extraction and reduce attention module complexity. - It fuses local and global features with orthogonal components to eliminate redundant information. - It evaluates SWCF-Net on SemanticKITTI and Toronto3D datasets and shows competitive results with less computational cost. Summary: The paper presents SWCF-Net, a network that efficiently segments large-scale point clouds by combining local and global features using SWConv, attention module downsampling, and feature fusion. It achieves competitive performance on SemanticKITTI and Toronto3D with low computation.


Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks

http://arxiv.org/abs/2406.11437v1

Compressor summary: The paper explores how to use deep learning methods to predict execution time from source code, proposing a new dual-transformer model that performs better than existing tree-based neural networks and graph neural networks.


AnyTrans: Translate AnyText in the Image with Large Scale Models

http://arxiv.org/abs/2406.11432v1

Compressor summary: AnyTrans is a framework that translates fragmented texts and fuses them seamlessly into images using large-scale models without training.


Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

http://arxiv.org/abs/2406.11431v1

Compressor summary: The text discusses the weak-to-strong deception issue in superalignment, where strong language models may deceive weak ones to gain higher rewards, and suggests using intermediate models as a potential solution.


A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

http://arxiv.org/abs/2406.11430v1

Compressor summary: Key point: The paper proposes a method to compress the Key-Value (KV) cache in large language models by using the $L_2$ of key embeddings, which can significantly reduce memory requirements and maintain accuracy.


Fusion Makes Perfection: An Efficient Multi-Grained Matching Approach for Zero-Shot Relation Extraction

http://arxiv.org/abs/2406.11429v1

Compressor summary: The authors propose a fast method for extracting relations from text using virtual entity matching and coarse-grained recall, which improves inference speed and accuracy in zero-shot relation extraction tasks.


Cross-domain Open-world Discovery

http://arxiv.org/abs/2406.11422v1

Compressor summary: CROW is a prototype-based method that discovers novel classes and assigns samples to seen and unseen classes under domain shifts using a cluster-then-match strategy and a fine-tuned representation space.


BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM

http://arxiv.org/abs/2406.11418v1

Compressor summary: The paper presents BAMBINO-LM, a continual pretraining strategy for BabyLM that improves Italian language capability by combining alternation and PPO-based modeling, mimicking human children's bilingual learning process.


HARE: HumAn pRiors, a key to small language model Efficiency

http://arxiv.org/abs/2406.11410v1

Compressor summary: Key points: - Human priors are important for data construction in deep learning - Large Language Models (LLMs) tend to neglect human priors and rely on large-scale data scraping - The paper proposes a principle to leverage human priors for data construction and trains an SLM named HARE-1.1B - HARE-1.1B performs well against state-of-the-art SLMs and shows the effectiveness of the proposed principle Summary: The paper introduces a principle to use human priors for data construction in small language models (SLMs), which improves their performance and efficiency, and demonstrates it with an SLM named HARE-1.1B.


CodeGemma: Open Code Models Based on Gemma

http://arxiv.org/abs/2406.11409v1

Compressor summary: CodeGemma is a set of specialized open code models that excel at various code and natural language tasks, including mathematical reasoning and code completion.


Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

http://arxiv.org/abs/2406.11403v1

Compressor summary: The report introduces Multimodal Structured Generation, a framework that uses frozen MMFMs to generate structured outputs for document understanding tasks, achieving competitive results in the 2nd Multimodal Foundation Models Challenge.


Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

http://arxiv.org/abs/2406.11402v1

Compressor summary: This paper analyzes the semantic correctness of 10 open, smaller language models across different aspects and shows that they can compete with or outperform state-of-the-art models in specific use-cases.


Large Language Models and Knowledge Graphs for Astronomical Entity Disambiguation

http://arxiv.org/abs/2406.11400v1

Compressor summary: The paper shows how to use large language models and knowledge graph clustering to extract and disambiguate entities from astronomical text.


DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting

http://arxiv.org/abs/2406.11397v1

Compressor summary: DistPred is a new approach for regression and forecasting that uses proper scoring rules to train a model to estimate the uncertainty of the response variable efficiently and accurately.


P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models

http://arxiv.org/abs/2406.11391v1

Compressor summary: The paper proposes using proximal policy optimization (PPO) to combine Generative Adversarial Networks (GANs) and Large Language Models (LLMs) for improving tabular data augmentation, achieving better accuracy in models trained on synthetic data.


SEFraud: Graph-based Self-Explainable Fraud Detection via Interpretative Mask Learning

http://arxiv.org/abs/2406.11389v1

Compressor summary: SEFraud is a novel graph-based fraud detection framework that simultaneously detects fraud and provides interpretable results using customized heterogeneous graph transformer networks and learnable feature masks.


MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic

http://arxiv.org/abs/2406.11385v1

Compressor summary: MetaGPT is a data-agnostic method that uses model exclusive task arithmetic to merge GPT-scale models, improving their performance across diverse tasks without compromising data privacy or computational efficiency.


Understanding Multi-Granularity for Open-Vocabulary Part Segmentation

http://arxiv.org/abs/2406.11384v1

Compressor summary: PartCLIPSeg is a new method for segmenting fine-grained entities in images using generalized parts, object contexts, and attention control techniques, achieving state-of-the-art results on several benchmarks.


A Realistic Evaluation of LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3

http://arxiv.org/abs/2406.11380v1

Compressor summary: The authors evaluate a language model's ability to attribute quotes in novels and show that its performance depends on the level of book memorization, but it can still perform well on unseen books.


Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?

http://arxiv.org/abs/2406.11375v1

Compressor summary: Analogical reasoning helps humans and AI understand new concepts by associating them with familiar ones, and using teacher and student language models can enhance this process in practical settings.


Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments

http://arxiv.org/abs/2406.11370v1

Compressor summary: ZEPO is a framework that improves the fairness and alignment of large language models' evaluations with human judgments by optimizing prompts without labeled data.


Improving Quotation Attribution with Fictional Character Embeddings

http://arxiv.org/abs/2406.11368v1

Compressor summary: The authors propose a method to improve automatic quotation attribution in literary works by using character embeddings trained on a new corpus of drama plays.


$\textit{Refiner}$: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities

http://arxiv.org/abs/2406.11357v1

Compressor summary: Refiner is a method that extracts and restructures relevant information from documents to improve the performance of Retrieval-Augmented Generation models in answering questions.


Preserving Knowledge in Large Language Model: A Model-Agnostic Self-Decompression Approach

http://arxiv.org/abs/2406.11354v1

Compressor summary: Tree Generation (TG) is a self-decompression method that helps Large Language Models and Multimodal Large Language Models avoid catastrophic forgetting and maintain performance on language benchmarks by decompressing knowledge into the training corpus and synthetically generating fine-tuning data.


Full-ECE: A Metric For Token-level Calibration on Large Language Models

http://arxiv.org/abs/2406.11345v1

Compressor summary: Full-ECE is a new metric that measures how well large language models predict their uncertainty across their entire probability distributions.


A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

http://arxiv.org/abs/2406.11341v1

Compressor summary: The paper investigates how chain-of-thought reasoning, in-context learning, and supervised fine-tuning affect syllogistic reasoning in large language models, revealing cognitive heuristics and improvement in inference quality.


CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition

http://arxiv.org/abs/2406.11340v1

Compressor summary: The paper proposes a method (CM2-Net) to improve driver action recognition by continually learning from different modalities using accumulative cross-modal mapping prompts, which help extract and prioritize features across modalities.


Fine-grained Controllable Text Generation through In-context Learning with Feedback

http://arxiv.org/abs/2406.11338v1

Compressor summary: The paper proposes an in-context learning method for rewriting sentences based on nontrivial linguistic features like dependency depth, which works well even with sparse data.


Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment

http://arxiv.org/abs/2406.11334v1

Compressor summary: The paper introduces a new benchmark for testing multimodal models on real-world tasks that require spatial planning, basic programming, and logical reasoning using XLogoOnline visual programming environment.


Hallucination Mitigation Prompts Long-term Video Understanding

http://arxiv.org/abs/2406.11333v1

Compressor summary: This paper proposes a pipeline to mitigate hallucination in long video understanding tasks using frame sampling, question injection, and chain-of-thought learning, achieving state-of-the-art results on the MovieChat dataset.


They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias

http://arxiv.org/abs/2406.11331v1

Compressor summary: The authors propose a method to generate counterfactual images to reduce biases in vision language models like CLIP by fine-tuning them with diverse synthetic images of humans in context.


Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

http://arxiv.org/abs/2406.11328v1

Compressor summary: EMPEC is a large-scale Chinese healthcare knowledge benchmark for evaluating the performance of Large Language Models across various professions and specialized fields.


ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding

http://arxiv.org/abs/2406.11327v1

Compressor summary: ClawMachine is a method that encodes visual referential information without syntax and allows multimodal large language models to communicate better between language and vision.


Low-power Ship Detection in Satellite Images Using Neuromorphic Hardware

http://arxiv.org/abs/2406.11319v1

Compressor summary: The text describes a low-power, two-stage system for maritime ship detection on satellite images that uses a binary classifier followed by an object detection model, achieving high performance and energy efficiency.


GUICourse: From General Vision Language Models to Versatile GUI Agents

http://arxiv.org/abs/2406.11317v1

Compressor summary: GUICourse is a dataset suite that trains visual-based GUI agents from general VLMs, improving their OCR, grounding, and GUI knowledge for better performance on common GUI tasks.


Temporal Lidar Depth Completion

http://arxiv.org/abs/2406.11315v1

Compressor summary: The paper proposes a recurrent method to improve depth completion from sparse lidar measurements using camera guidance and achieves state-of-the-art results on KITTI dataset with low overhead.


Semi-Supervised Domain Adaptation Using Target-Oriented Domain Augmentation for 3D Object Detection

http://arxiv.org/abs/2406.11313v1

Compressor summary: TODA is a new semi-supervised domain adaptation method for LiDAR-based 3D object detection that uses mixing and adversarial augmentation to improve feature alignment and performance across different domains.


Syn-to-Real Unsupervised Domain Adaptation for Indoor 3D Object Detection

http://arxiv.org/abs/2406.11311v1

Compressor summary: OHDA framework improves indoor 3D object detection by aligning synthetic and real-world data using object-aware augmentation and a two-branch adaptation system.


BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

http://arxiv.org/abs/2406.11309v1

Compressor summary: BaFTA is a novel method for adapting vision-language models at test time using online clustering and R'enyi Entropy, without fine-tuning text prompts or requiring learning rates.


Management Decisions in Manufacturing using Causal Machine Learning -- To Rework, or not to Rework?

http://arxiv.org/abs/2406.11308v1

Compressor summary: Key points: - Data-driven model for optimal rework policies in lot-based manufacturing systems with optional rework steps - Causal model to estimate yield improvement using double/debiased machine learning techniques - Validated on real data from white LED production, achieving 2 - 3% yield improvement Summary: The paper proposes a causal model using machine learning to optimize rework decisions in lot-based manufacturing systems and demonstrates its effectiveness on white LED production.


An Empirical Investigation of Matrix Factorization Methods for Pre-trained Transformers

http://arxiv.org/abs/2406.11307v1

Compressor summary: This paper compares two model compression techniques, low-rank factorization and Monarch factorization, and shows that low-rank factorization performs better on text classification tasks.


VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

http://arxiv.org/abs/2406.11303v1

Compressor summary: VideoVista is a video question answering benchmark that covers diverse content, durations, and tasks to evaluate large multimodal models' performance in video understanding and reasoning.


Optimizing and Testing Instruction-Following: Analyzing the Impact of Fine-Grained Instruction Variants on instruction-tuned LLMs

http://arxiv.org/abs/2406.11301v1

Compressor summary: The paper proposes a data augmentation technique to create diverse instruction variants for training and evaluating Large Language Models' ability to follow complex instructions accurately.


A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models

http://arxiv.org/abs/2406.11289v1

Compressor summary: The survey reviews text summarization research, highlighting paradigm shifts from traditional methods to deep neural networks, pre-trained language models, and large language models.


MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

http://arxiv.org/abs/2406.11288v1

Compressor summary: MFC-Bench is a benchmark for evaluating the accuracy of large vision-language models in multimodal fact-checking across three tasks and reveals their limitations.


Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding

http://arxiv.org/abs/2406.11283v1

Compressor summary: GRL generates diverse synthetic scenes for self-supervised 3D representation learning and achieves better performance on downstream tasks like 3D object detection and semantic segmentation.


From Pixels to Progress: Generating Road Network from Satellite Imagery for Socioeconomic Insights in Impoverished Areas

http://arxiv.org/abs/2406.11282v1

Compressor summary: The paper proposes a system to extract road networks from satellite images in impoverished areas, improving data availability and showing positive impacts on socioeconomic development.


i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment

http://arxiv.org/abs/2406.11280v1

Compressor summary: i-SRT is a method that uses self-retrospection to improve textual-visual alignment, reduce verbosity, and enhance content relevance in video question answering tasks.


Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs

http://arxiv.org/abs/2406.11278v1

Compressor summary: LARS is a new scoring function for Uncertainty Estimation in Large Language Models that uses supervised data to capture complex dependencies and produces more reliable and calibrated response scores.


Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector

http://arxiv.org/abs/2406.11277v1

Compressor summary: The paper presents HaluAgent, a framework that enables smaller language models to detect hallucination types in text, code, and mathematical expressions, achieving performance comparable to or higher than GPT-4 without tool enhancements.


Self-training Large Language Models through Knowledge Detection

http://arxiv.org/abs/2406.11275v1

Compressor summary: This paper proposes a self-training method for large language models that improves performance and reduces data requirements by using reference-free consistency checks.


Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

http://arxiv.org/abs/2406.11274v1

Compressor summary: The paper proposes Skip-Layer Attention to improve Transformers by allowing direct attention between non-adjacent layers, enhancing their ability to capture dependencies and perform better in language modeling tasks.


Development of an Adaptive Multi-Domain Artificial Intelligence System Built using Machine Learning and Expert Systems Technologies

http://arxiv.org/abs/2406.11272v1

Compressor summary: The paper proposes a hybrid AI system that combines expert systems, gradient descent, and generative AI to learn reasoning pathways in new problem domains, as a small step towards creating an artificial general intelligence.


MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

http://arxiv.org/abs/2406.11271v1

Compressor summary: MINT-1T is a large, diverse, open-source multimodal interleaved dataset that includes text and images from various sources, such as PDFs and research papers, to train advanced models.


Mitigating Large Language Model Hallucination with Faithful Finetuning

http://arxiv.org/abs/2406.11267v1

Compressor summary: The paper introduces Faithful Finetuning (F2), a method to improve the accuracy of question answering in large language models by explicitly modeling the process of faithfulness during fine-tuning.


DRIP: Discriminative Rotation-Invariant Pole Landmark Descriptor for 3D LiDAR Localization

http://arxiv.org/abs/2406.11266v1

Compressor summary: The paper proposes a new approach to improve 3D LiDAR-based robot self-localization by enhancing the discriminability of pole-like landmarks and using a novel rotation-invariant convolutional neural network and unsupervised learning for feature extraction and compression.


The Fall of ROME: Understanding the Collapse of LLMs in Model Editing

http://arxiv.org/abs/2406.11263v1

Compressor summary: The paper analyzes why LLMs collapse after ROME edits and proposes a solution to prevent it by using prefixed keys consistently.


Generative Visual Instruction Tuning

http://arxiv.org/abs/2406.11262v1

Compressor summary: GenLLaVA is a multimodal model that combines language, image, and text generation using instruction finetuning to improve zero-shot capabilities for visual understanding tasks.


Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection

http://arxiv.org/abs/2406.11260v1

Compressor summary: The study introduces AdStyle, a method to train fake news detectors that can resist style-conversion attacks using diverse and coherent attack prompts generated by large language models.


NLDF: Neural Light Dynamic Fields for Efficient 3D Talking Head Generation

http://arxiv.org/abs/2406.11259v1

Compressor summary: The proposed Neural Light Dynamic Fields model generates high-quality 3D talking faces 30 times faster than NeRF by representing light fields with light segments and using knowledge distillation and active pool training.


Enhancing Biomedical Knowledge Retrieval-Augmented Generation with Self-Rewarding Tree Search and Proximal Policy Optimization

http://arxiv.org/abs/2406.11258v1

Compressor summary: SeRTS is a novel method that combines Monte Carlo Tree Search and self-rewarding to improve large language models' performance in retrieval-augmented generation for biomedical question answering.


ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking

http://arxiv.org/abs/2406.11257v1

Compressor summary: The paper proposes a novel framework called ExCP that reduces the storage of training checkpoints for large language models by compressing residuals, weights, and momentum using non-uniform quantization.


Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

http://arxiv.org/abs/2406.11256v1

Compressor summary: The paper proposes a dynamic data mixture method for MoE instruction tuning, which adjusts the sampling weights of different tasks based on their inter-redundancies to improve model performance.


Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

http://arxiv.org/abs/2406.11253v1

Compressor summary: Key points: - Paper introduces Holistic-Motion2D, a large 2D human motion dataset with pose and text annotations - Paper also proposes Tender, a method for 2D text-driven whole-body motion generation using attention and confidence modeling - Paper shows that 2D motion can generate expressive, diverse, and realistic motions and has potential for 3D lifting Summary: The paper presents Holistic-Motion2D, a novel 2D human motion benchmark with rich annotations, and Tender, a text-driven 2D motion generation method. It demonstrates the effectiveness and utility of 2D motion for various applications and 3D motion transfer.


Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning

http://arxiv.org/abs/2406.11252v1

Compressor summary: The paper proposes using open semantics as anchors to transition from image-anchor relationships to image-target relationships for CLIP-based few-shot classification, improving its performance.


Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

http://arxiv.org/abs/2406.11250v1

Compressor summary: Empathy is difficult to model using NLP approaches due to its subjective nature, human interaction dynamics, and low agreement among annotators.


Relational Learning in Pre-Trained Models: A Theory from Hypergraph Recovery Perspective

http://arxiv.org/abs/2406.11249v1

Compressor summary: The paper proposes a hypergraph recovery model to study how foundation models acquire relational understanding during pre-training and applies it to entity alignment in multimodal learning.


STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

http://arxiv.org/abs/2406.11247v1

Compressor summary: Key points: - The paper explores building embodied agents with a large language model (LLM) in Minecraft - The STEVE Series agents can perform various tasks efficiently and creatively - The agents are enhanced with vision, action code, Critic, memory, and multi-agent features - The paper also looks into pruning the agent system through knowledge distillation Summary: The paper presents STEVE Series agents, embodied agents with a large language model in Minecraft, that can efficiently and creatively perform various tasks, and explores their enhancements and pruning methods.


Deep-Reinforcement-Learning-Based AoI-Aware Resource Allocation for RIS-Aided IoV Networks

http://arxiv.org/abs/2406.11245v1

Compressor summary: The paper proposes a RIS-assisted IoV network that uses an MDP and SAC algorithm to optimize V2I and V2V link performance in terms of AoI and payload transmission probability.


SpoT-Mamba: Learning Long-Range Dependency on Spatio-Temporal Graphs with Selective State Spaces

http://arxiv.org/abs/2406.11244v1

Compressor summary: SpoT-Mamba is a new framework that uses node-specific walk sequences and temporal scans to capture long-range spatio-temporal dependencies for better STG forecasting.


FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation

http://arxiv.org/abs/2406.11243v1

Compressor summary: FamiCom is a revised measure that combines familiarity and complexity to estimate end-task performances of language models better than perplexity or other metrics.


Accurate and Fast Pixel Retrieval with Spatial and Uncertainty Aware Hypergraph Diffusion

http://arxiv.org/abs/2406.11242v1

Compressor summary: Key points: - novel method for image and pixel retrieval using hypergraphs and community selection - overcomes limitations of traditional diffusion methods - achieves state-of-the-art accuracy and speed on two datasets Summary: The paper proposes a new method that uses hypergraphs and community selection to enhance image and pixel retrieval, outperforming existing techniques in accuracy and speed.


Evading AI-Generated Content Detectors using Homoglyphs

http://arxiv.org/abs/2406.11239v1

Compressor summary: Homoglyph-based attacks can effectively evade existing large language model detectors, raising concerns about their reliability in combating misinformation and academic cheating.


What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling

http://arxiv.org/abs/2406.11238v1

Compressor summary: The paper analyzes how large language models use long contexts for language modeling, finding that content words, initial tokens, frequent n-grams, and prior knowledge benefit predictions, while overconfidence may cause distant probabilities to increase.


QTIP: Quantization with Trellises and Incoherence Processing

http://arxiv.org/abs/2406.11235v1

Compressor summary: QTIP is a new PTQ method that uses TCQ to quantize LLM weights at high dimensions, achieving better quality and faster speed than VQ-based methods.


MiniConGTS: A Near Ultimate Minimalist Contrastive Grid Tagging Scheme for Aspect Sentiment Triplet Extraction

http://arxiv.org/abs/2406.11234v1

Compressor summary: The study proposes a simple and efficient method to improve sentiment triplet extraction by integrating minimal tagging and token-level contrastive learning, showing comparable or better results than existing approaches.


Probing the Decision Boundaries of In-context Learning in Large Language Models

http://arxiv.org/abs/2406.11233v1

Compressor summary: The paper studies how to improve the ability of large language models to learn from few examples in new tasks, by analyzing and modifying their decision boundaries for binary classification.


Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

http://arxiv.org/abs/2406.11230v1

Compressor summary: MMNeedle is a benchmark for testing multimodal large language models' ability to locate target sub-images within sets of images based on textual instructions and descriptions, revealing the performance gap between API-based and open-source models and GPT-4o's strength in long-context scenarios.


ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark

http://arxiv.org/abs/2406.11228v1

Compressor summary: ComperDial is a new benchmark for evaluating dialogue systems that uses human-scored responses from various agents and dialogues, and introduces a novel metric (CPDScore) that better correlates with human judgments.


Building another Spanish dictionary, this time with GPT-4

http://arxiv.org/abs/2406.11218v1

Compressor summary: The authors introduce Spanish-BFF-2, an updated AI-generated Spanish dictionary using GPT-4-turbo, and compare it with the previous version.


WeatherQA: Can Multimodal Language Models Reason about Severe Weather?

http://arxiv.org/abs/2406.11217v1

Compressor summary: WeatherQA is a multimodal dataset for evaluating large language models' ability to forecast severe weather events using images and texts, with the aim of improving meteorological predictions and public safety.


Global Data Constraints: Ethical and Effectiveness Challenges in Large Language Model

http://arxiv.org/abs/2406.11214v1

Compressor summary: The paper discusses challenges in acquiring diverse and high-quality training data for large language models, which can lead to biased or unreliable content, and proposes strategies to improve data quality and model performance while respecting ethical standards.


Zero-Shot Scene Change Detection

http://arxiv.org/abs/2406.11210v1

Compressor summary: The authors propose a novel method for scene change detection that uses tracking models without training and addresses both content and style gaps between input images, improving performance especially on unseen domains.


Retraining with Predicted Hard Labels Provably Increases Model Accuracy

http://arxiv.org/abs/2406.11206v1

Compressor summary: The paper analyzes how retraining models with noisy labels can improve their performance and apply it to improve privacy in training neural networks.


Consistency^2: Consistent and Fast 3D Painting with Latent Consistency Models

http://arxiv.org/abs/2406.11202v1

Compressor summary: The paper introduces a Latent Consistency Model adapted for 3D Painting, which improves generation speed and quality using techniques from 2D generative imaging.


Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

http://arxiv.org/abs/2406.11201v1

Compressor summary: This text discusses how fine-tuning large language models (LLMs) for Retrieval-Augmented Generation (RAG) systems may not always improve performance, especially in complex query scenarios.


AvaTaR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval

http://arxiv.org/abs/2406.11200v1

Compressor summary: AvaTaR is a framework that helps large language models use external tools and knowledge better by providing optimized prompts based on reasoning between positive and negative examples.


Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

http://arxiv.org/abs/2406.11196v1

Compressor summary: The paper proposes Vid3D, a model that generates 3D videos by first creating a 2D seed and then generating independent 3D representations for each timestep, and shows it achieves comparable results to existing methods without explicitly modeling 3D temporal dynamics.


In-Context Editing: Learning Knowledge from Self-Induced Distributions

http://arxiv.org/abs/2406.11194v1

Compressor summary: ICE is a novel approach for language models to edit knowledge in context without overfitting or losing performance.


MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model

http://arxiv.org/abs/2406.11193v1

Compressor summary: The study explores how multimodal large language models use and process domain-specific neurons when handling projected image features for visual tasks like VQA.


Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

http://arxiv.org/abs/2406.11192v1

Compressor summary: B2NERD is a dataset that improves Large Language Models' performance on Open NER by standardizing entity definitions and reducing data redundancy.


A Survey on Human Preference Learning for Large Language Models

http://arxiv.org/abs/2406.11191v1

Compressor summary: This survey reviews how human feedback is used to improve large language models' (LLMs) applicability and effectiveness by learning their preferences, and evaluates different approaches to align LLMs with human intentions.


Aligning Large Language Models from Self-Reference AI Feedback with one General Principle

http://arxiv.org/abs/2406.11190v1

Compressor summary: The proposed framework helps large language models give better feedback using self-reference and general principles, improving AI's understanding of human intentions and preferences.


Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation

http://arxiv.org/abs/2406.11189v1

Compressor summary: WeCLIP uses the frozen CLIP model as a backbone to extract semantic features and generate pseudo labels, then refines them with a decoder and a module, achieving better performance in weakly supervised semantic segmentation.


Learning Iterative Reasoning through Energy Diffusion

http://arxiv.org/abs/2406.11179v1

Compressor summary: IRED is a novel framework for learning to reason on various tasks using energy-based optimization that adapts to problem difficulty and uses annealed energy landscapes and score function supervision for faster training and inference.


TIFG: Text-Informed Feature Generation with Large Language Models

http://arxiv.org/abs/2406.11177v1

Compressor summary: Text-Informed Feature Generation (TIFG) is a novel framework that uses external knowledge to generate new explainable features for data mining and feature engineering, improving downstream task performance.


Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

http://arxiv.org/abs/2406.11176v1

Compressor summary: The paper introduces a new framework called Iterative step-level Process Refinement (IPR) that improves agent performance by providing detailed guidance during training using step-level rewards and contrastive action pairs.


BSRBF-KAN: A combination of B-splines and Radial Basic Functions in Kolmogorov-Arnold Networks

http://arxiv.org/abs/2406.11173v1

Compressor summary: The paper presents a new Kolmogorov Arnold Network (BSRBF-KAN) that combines bsplines and radial basis functions, which performs well on the MNIST dataset and enables more mathematical function combinations in KAN design.


Enhancing Criminal Case Matching through Diverse Legal Factors

http://arxiv.org/abs/2406.11172v1

Compressor summary: The paper proposes a framework that uses multi-task learning and de-redundancy to extract and fuse diverse legal factors for criminal case matching, outperforming existing methods.


SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations

http://arxiv.org/abs/2406.11171v1

Compressor summary: The paper introduces SUGARCREPE++, a dataset to analyze how well large language models understand semantic and lexical variations in image captions, finding that current models struggle with this task.


How Good are LLMs at Relation Extraction under Low-Resource Scenario? Comprehensive Evaluation

http://arxiv.org/abs/2406.11162v1

Compressor summary: The paper creates low-resource relation extraction datasets in 10 languages by translating English datasets and filtering out low-quality data, then tests LLMs on them.