arxiv compressed, 2024-07-23

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-23 generated by the compressor, my personal LLM-based project.


AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

http://arxiv.org/abs/2407.15850v1

Compressor summary: The paper presents AutoAD-Zero, a method to generate audio descriptions for movies and TV series using visual and textual cues without training, which outperforms some fine-tuned models.


BoostMVSNeRFs: Boosting MVS-based NeRFs to Generalizable View Synthesis in Large-scale Scenes

http://arxiv.org/abs/2407.15848v1

Compressor summary: The paper introduces BoostMVSNeRFs, a method to enhance MVS-based NeRF rendering quality in large-scale scenes by selecting and combining multiple cost volumes during volume rendering without additional training.


Reconstructing Training Data From Real World Models Trained with Transfer Learning

http://arxiv.org/abs/2407.15845v1

Compressor summary: The paper proposes a novel method for reconstructing data from trained classifiers in realistic settings, enabling data reconstruction beyond visual data and improving privacy risk awareness.


CarFormer: Self-Driving with Learned Object-Centric Representations

http://arxiv.org/abs/2407.15843v1

Compressor summary: The paper proposes object-centric slot representations in Bird's eye view for self-driving, which outperform other approaches in driving tasks and forecasting future scenes.


HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning

http://arxiv.org/abs/2407.15844v1

Compressor summary: The paper proposes an end-to-end method for predicting 3D hand meshes from RGB images, preserving contextual and scale information, and shows its effectiveness in experiments.


Artist: Aesthetically Controllable Text-Driven Stylization without Training

http://arxiv.org/abs/2407.15842v1

Compressor summary: Artist is a training-free method that controls content and style generation in diffusion models for text-driven stylization by separating denoising processes and suppressing irrelevant content.


SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

http://arxiv.org/abs/2407.15841v1

Compressor summary: SF-LLaVA is a training-free video large language model that combines spatial and temporal features from video frames using a SlowFast design, achieving high performance on various video tasks.


MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

http://arxiv.org/abs/2407.15838v1

Compressor summary: MMInstruct is a diverse visual instruction tuning dataset that improves the performance of Vision Large Language Models by addressing existing limitations in instruction quality and diversity.


Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

http://arxiv.org/abs/2407.15837v1

Compressor summary: Latent MIM combines masked image modeling (MIM) with latent space reconstruction to learn high-level visual representations from unlabeled images, addressing challenges like representation collapsing and region correlation.


dMel: Speech Tokenization made Simple

http://arxiv.org/abs/2407.15835v1

Compressor summary: The paper introduces dMel, a simple speech tokenization method that outperforms existing methods on speech recognition and synthesis tasks.


J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

http://arxiv.org/abs/2407.15828v1

Compressor summary: This study introduces a new, large-scale, spontaneous, and acoustically clean spoken dialogue corpus for human-AI interactions called J-CHAT and demonstrates its effectiveness in improving dialogue generation models.


On shallow planning under partial observability

http://arxiv.org/abs/2407.15820v1

Compressor summary: This paper explores how different discount factors affect reinforcement learning performance and suggests using shorter horizons for partially observable environments.


Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

http://arxiv.org/abs/2407.15819v1

Compressor summary: Chain-of-Sight is a module that accelerates MLLMs' pre-training by efficiently using visual details and extending visual tokens, reducing training time without sacrificing performance.


Efficient and generalizable prediction of molecular alterations in multiple cancer cohorts using H&E whole slide images

http://arxiv.org/abs/2407.15816v1

Compressor summary: The authors propose a multi-task algorithm for predicting multiple DNA alterations from H&E images, which could help prioritize samples for molecular testing and improve the detection of rare mutations.


Perceptions of Linguistic Uncertainty by Language Models and Humans

http://arxiv.org/abs/2407.15814v1

Compressor summary: The paper investigates how well language models can interpret uncertainty expressions and map them to probabilities, finding that most models perform similarly to humans but are more biased by prior knowledge.


Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

http://arxiv.org/abs/2407.15811v1

Compressor summary: The authors propose a low-cost method for training text-to-image generative models using deferred masking, synthetic images, and improved transformer architecture, achieving competitive results with much lower computational and financial costs than existing approaches.


Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems

http://arxiv.org/abs/2407.15810v1

Compressor summary: The authors introduce a new face dataset for adversarial audits and robust FRS training, highlight disparities in gender prediction accuracy across Global North and South, and propose low-resource bias mitigation techniques using few-shot and contrastive learning.


FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones

http://arxiv.org/abs/2407.15806v1

Compressor summary: The paper introduces FSboard, the largest fingerspelling recognition dataset for American Sign Language, collected from Deaf signers using mobile cameras, and presents a baseline model achieving 11.1% CER.


Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation

http://arxiv.org/abs/2407.15798v1

Compressor summary: The EMC framework enhances the generation of contextually appropriate and diverse listener facial reactions based on the speaker's multimodal behaviour, emotional context, and robustness to missing modalities.


AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

http://arxiv.org/abs/2407.15795v1

Compressor summary: AdaCLIP is a vision-language model that uses learnable prompts to detect anomalies in images from unseen categories, achieving better performance than other methods.


CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

http://arxiv.org/abs/2407.15793v1

Compressor summary: The paper proposes a novel approach called Continual Generative training for Incremental prompt-Learning that uses generative replay to adapt Vision-Language Models to new tasks while preserving zero-shot capabilities.


Robust Mixture Learning when Outliers Overwhelm Small Groups

http://arxiv.org/abs/2407.15792v1

Compressor summary: The paper proposes an algorithm that estimates the means of mixtures with outliers, achieving optimal error guarantees with minimal overhead and leveraging mixture structure when possible.


Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach

http://arxiv.org/abs/2407.15788v1

Compressor summary: The paper introduces a system using Large Language Models to extract company tickers, sentiment analysis, and summaries from raw financial news without relying on pre-structured data feeds.


Unsupervised Mastoidectomy for Cochlear CT Mesh Reconstruction Using Highly Noisy Data

http://arxiv.org/abs/2407.15787v1

Compressor summary: The authors propose an unsupervised learning method to synthesize mastoidectomy volumes from preoperative CT scans for cochlear implant intraoperative navigation.


Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels

http://arxiv.org/abs/2407.15786v1

Compressor summary: LICORICE is a novel RL algorithm that learns interpretable policies using few human labels and concept ensembles.


Explaining Decisions in ML Models: a Parameterized Complexity Analysis

http://arxiv.org/abs/2407.15780v1

Compressor summary: The paper studies how hard it is to explain different ML models with transparent mechanisms, focusing on abductive and contrastive problems.


STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay

http://arxiv.org/abs/2407.15773v1

Compressor summary: The paper proposes a new test-time adaptation method called STAMP, which uses a stable memory bank to improve recognition and outlier rejection for both known and unknown classes during inference.


Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis

http://arxiv.org/abs/2407.15763v1

Compressor summary: The paper proposes a method for open-world object detection that uses self-supervised learning and virtual outlier synthesis to detect anomalous objects without relying on class labels, achieving state-of-the-art results on various datasets.


Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning

http://arxiv.org/abs/2407.15762v1

Compressor summary: CLP is a general framework for finetuning language models on multiple objectives, enabling them to learn from feedback and balance conflicting goals like creativity and safety.


Model editing for distribution shifts in uranium oxide morphological analysis

http://arxiv.org/abs/2407.15756v1

Compressor summary: Model editing helps deep learning models generalize better to distribution shifts in uranium ore concentrate classification.


LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

http://arxiv.org/abs/2407.15754v1

Compressor summary: LongVideoBench is a new question-answering benchmark for multimodal models that tests their ability to understand and reason over long videos with subtitles, posing challenges even for advanced proprietary models.


Parallel Split Learning with Global Sampling

http://arxiv.org/abs/2407.15738v1

Compressor summary: UGS and LDS are proposed methods that improve model accuracy and reduce training time in non-IID settings and straggler effects in DDL systems by optimizing the mini-batch sampling process.


OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context

http://arxiv.org/abs/2407.15736v1

Compressor summary: OMoS-QA is a dataset for immigration counseling that contains questions, documents, and answers in German and English, which is used to compare 5 pretrained LLMs on extractive question answering tasks with favorable trade-offs between precision and recall.


TaskGen: A Task-Based, Memory-Infused Agentic Framework using StrictJSON

http://arxiv.org/abs/2407.15734v1

Compressor summary: TaskGen is a framework that uses agents and tasks to solve various problems with high success rates, by managing information efficiently and reducing verbosity.


Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders

http://arxiv.org/abs/2407.15731v1

Compressor summary: The IIMM measures how much a vision-language model will learn or forget after fine-tuning and can help predict performance gains and losses.


Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning

http://arxiv.org/abs/2407.15724v1

Compressor summary: Maximizing dataset diversity measured by $A$ (generalized entropy) improves image classification performance in deep learning, especially for medical imaging.


DStruct2Design: Data and Benchmarks for Data Structure Driven Generative Floor Plan Design

http://arxiv.org/abs/2407.15723v1

Compressor summary: The authors propose a new approach to generate floorplans with numerical constraints using data structures and evaluate it with a new dataset and benchmarks.


Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

http://arxiv.org/abs/2407.15720v1

Compressor summary: The study investigates how large language models perform on various composite tasks, finding that they handle simpler tasks well but struggle with complex ones that require multiple steps of reasoning.


GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI

http://arxiv.org/abs/2407.15719v1

Compressor summary: The authors propose GFE-Mamba, a classifier that uses generative feature extraction to integrate multimodal data from assessment scales, MRI, and PET for accurate prediction of Alzheimer's disease progression in patients with mild cognitive impairment.


AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

http://arxiv.org/abs/2407.15711v1

Compressor summary: AssistantBench is a benchmark for testing language agents on complex web tasks, showing their limitations and introducing a new agent called SeePlanAct that performs better than others.


SwinSF: Image Reconstruction from Spatial-Temporal Spike Streams

http://arxiv.org/abs/2407.15708v1

Compressor summary: Swin Spikeformer is a novel model for dynamic scene reconstruction from spike streams that uses shifted window self-attention and temporal spike attention to achieve state-of-the-art performance in high-speed imaging with spike cameras.


Predicting the Best of N Visual Trackers

http://arxiv.org/abs/2407.15707v1

Compressor summary: The paper proposes a meta-tracker that predicts the best visual tracker for a given video sequence based on initial frames and outperforms existing trackers on various benchmarks.


Estimating Probability Densities with Transformer and Denoising Diffusion

http://arxiv.org/abs/2407.15703v1

Compressor summary: The paper proposes a Transformer+Denoising Diffusion model that estimates probability density distribution for regression problems in science, using astronomical observations as an example.


Counter Turing Test ($CT^2$): Investigating AI-Generated Text Detection for Hindi -- Ranking LLMs based on Hindi AI Detectability Index ($ADI_{hi}$)

http://arxiv.org/abs/2407.15694v1

Compressor summary: The paper investigates AI-generated text detection for Hindi, evaluating 26 LLMs, introducing a new dataset ($AG_{hi}$), and proposing a detectability index ($ADI_{hi}$).


HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

http://arxiv.org/abs/2407.15680v1

Compressor summary: HaloQuest is a new dataset for testing vision-language models on multimodal hallucination using synthetic images and real ones, helping to improve their reliability and performance.


Problems in AI, their roots in philosophy, and implications for science and society

http://arxiv.org/abs/2407.15671v1

Compressor summary: The paper argues for more attention to philosophical aspects of AI technology and its use, and criticizes common theories of knowledge that influence current AI practices.


SLVideo: A Sign Language Video Moment Retrieval Framework

http://arxiv.org/abs/2407.15668v1

Compressor summary: SLVideo is a software that recognizes both hand and facial signs in Sign Language videos, enabling searching and creating a thesaurus of similar videos.


TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly

http://arxiv.org/abs/2407.15648v1

Compressor summary: The paper proposes a class-agnostic tree-transformer framework that predicts sequential assembly actions for 3D objects with primitive bricks from multi-view images using synthetic-to-real transfer learning and action-to-silhouette projection.


SS-SFR: Synthetic Scenes Spatial Frequency Response on Virtual KITTI and Degraded Automotive Simulations for Object Detection

http://arxiv.org/abs/2407.15646v1

Compressor summary: The study evaluates the effect of Gaussian blur on image sharpness and object detection performance in automotive simulation using Virtual KITTI dataset.


Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

http://arxiv.org/abs/2407.15645v1

Compressor summary: Psychometric alignment is a metric that measures how well language models reflect human knowledge distribution and can be used to improve their performance in various domains.


Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models

http://arxiv.org/abs/2407.15642v1

Compressor summary: Cinemo is a novel image animation method that improves motion controllability, temporal consistency, and smoothness by learning motion residuals, using a structural similarity index, and refining noise with discrete cosine transformation.


Reinforcement Learning Meets Visual Odometry

http://arxiv.org/abs/2407.15626v1

Compressor summary: The paper proposes a Reinforcement Learning framework for Visual Odometry, which adapts dynamically based on real-time conditions and reduces reliance on heuristic design choices.


RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation

http://arxiv.org/abs/2407.15621v1

Compressor summary: RadioRAG is an end-to-end framework that uses real-time online data to improve the accuracy of LLMs in answering radiology-specific questions, showing consistent improvements across various models.


Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

http://arxiv.org/abs/2407.15613v1

Compressor summary: The paper proposes a network to extract and align multi-view semantic concepts from documents and images for better document-based zero-shot learning.


Can GPT-4 learn to analyze moves in research article abstracts?

http://arxiv.org/abs/2407.15612v1

Compressor summary: The paper proposes using GPT-4 to automatically identify moves in written discourse by creating prompts informed by linguistic expertise.


Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets

http://arxiv.org/abs/2407.15611v1

Compressor summary: The paper introduces DMC, a feature selection method for high-dimensional small datasets that considers both feature values and the distribution of observations in the response variable, and uses GAwAR to find the best combination of features for binary classification.


StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation

http://arxiv.org/abs/2407.15608v1

Compressor summary: StylusAI is a new architecture that uses diffusion models to blend handwriting styles between English and German, improving legibility and diversity while outperforming existing models on two datasets.


Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models

http://arxiv.org/abs/2407.15605v1

Compressor summary: The paper evaluates how perspective changes affect foundation models' performance in recognizing fine-grained human activities, comparing different architectures and strategies for handling temporal information.


Discrete Flow Matching

http://arxiv.org/abs/2407.15595v1

Compressor summary: Discrete Flow Matching is a novel generative model for discrete data that improves quality and efficiency compared to previous methods.


Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information

http://arxiv.org/abs/2407.15593v1

Compressor summary: The paper proposes a data-driven active localization method with viewpoint selection and self-supervised training that outperforms existing methods and can be integrated into real-world robotics applications, along with an open-source implementation.


All rivers run into the sea: Unified Modality Brain-like Emotional Central Mechanism

http://arxiv.org/abs/2407.15590v1

Compressor summary: UMBEnet is a brain-like network that processes human emotions using multiple sensory modalities, achieving state-of-the-art results in facial expression recognition.


Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

http://arxiv.org/abs/2407.15589v1

Compressor summary: This paper studies object-centric representations for visual question answering and compares them with foundation models on synthetic and real data.


Unsupervised Robust Cross-Lingual Entity Alignment via Joint Modeling of Entity and Relation Texts

http://arxiv.org/abs/2407.15588v1

Compressor summary: ERAlign is an unsupervised cross-lingual entity alignment framework that uses semantic textual features and a verification process to achieve near-perfect alignment despite noisy data.


Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

http://arxiv.org/abs/2407.15580v1

Compressor summary: aMCL is a learning method that combines simulated annealing with MCL to improve prediction diversity and avoid suboptimal local minima in ambiguous tasks.


An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought

http://arxiv.org/abs/2407.15569v1

Compressor summary: The paper explores how RAFT, a method that combines chain-of-thought with supervised fine-tuning and retrieval augmented generation, improves the performance and reasoning abilities of generative dialogue models across various tasks and languages.


Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video Retrieval

http://arxiv.org/abs/2407.15566v1

Compressor summary: HAP-VR is a framework to improve video retrieval by addressing challenges in similarity measure and loss estimation, achieving better performance than existing methods on benchmark datasets.


SETTP: Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning

http://arxiv.org/abs/2407.15556v1

Compressor summary: The paper introduces SETTP, a method for effective style transfer in low-resource scenarios, which learns and transfers source style prompts and uses instance-level prompts to reduce semantic bias.


Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping

http://arxiv.org/abs/2407.15554v1

Compressor summary: DNMap is a storage-efficient method for large-scale 3D neural mapping that uses discrete representations, component vectors, and low-resolution continuous embeddings.


Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

http://arxiv.org/abs/2407.15549v1

Compressor summary: The text discusses how targeted latent adversarial training (LAT) can improve the robustness of large language models (LLMs) to various undesirable behaviors, such as jailbreaking, backdoors, and unlearning specific knowledge.


Inverted Activations

http://arxiv.org/abs/2407.15545v1

Compressor summary: The paper proposes a memory-efficient method for neural network training by saving output tensors instead of input tensors in pointwise nonlinearity layers, which improves performance without sacrificing accuracy.


Differentiable Product Quantization for Memory Efficient Camera Relocalization

http://arxiv.org/abs/2407.15540v1

Compressor summary: The authors propose a lightweight auto-encoder network that compresses 3D maps while preserving descriptor matching performance for camera relocalization.


Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints

http://arxiv.org/abs/2407.15537v1

Compressor summary: EPO is a new method for Constrained Reinforcement Learning that uses adaptive penalties generated by a Penalty Metric Network to balance policy performance and constraint satisfaction efficiently.


Double Deep Learning-based Event Data Coding and Classification

http://arxiv.org/abs/2407.15531v1

Compressor summary: The paper proposes a new double deep learning-based architecture that efficiently codes event data from event cameras and performs classification using point cloud representations, achieving similar performance to original data even with lossy compression.


Interpretable Concept-Based Memory Reasoning

http://arxiv.org/abs/2407.15527v1

Compressor summary: The paper proposes a new deep learning model that incorporates human-understandable concepts and enables users to verify its decision-making process before deployment.


Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

http://arxiv.org/abs/2407.15526v1

Compressor summary: The paper introduces Knowledge Recycling, a pipeline that optimizes synthetic data generation and use for training classifiers, improving their quality, usefulness, and privacy properties.


Multiple importance sampling for stochastic gradient estimation

http://arxiv.org/abs/2407.15525v1

Compressor summary: The paper presents a framework for efficient importance sampling of mini-batch samples for gradient estimation from multiple distributions, which adapts to noisy gradients and vector-valued gradients, leading to faster training convergence.


Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

http://arxiv.org/abs/2407.15516v1

Compressor summary: The study explores how dropping MLP and attention layers from LLMs at inference time can speed up their performance with minimal loss in accuracy.


Increasing the Robustness of Model Predictions to Missing Sensors in Earth Observation

http://arxiv.org/abs/2407.15512v1

Compressor summary: The paper proposes two new methods for handling missing data in multi-sensor machine learning models for Earth observation and evaluates their effectiveness using experiments on three datasets.


Algebraic anti-unification

http://arxiv.org/abs/2407.15510v1

Compressor summary: The text discusses abstraction as a key element for AI, introduces the concept of anti-unification, and proposes an algebraic approach to this process based on recent applications in similarity and analogy.


Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

http://arxiv.org/abs/2407.15508v1

Compressor summary: Quantized Large Language Models (LLMs) can achieve excellent performance while reducing resource consumption using innovative methods based on Learnable Singular-value Increment (LSI).


SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over Time

http://arxiv.org/abs/2407.15507v1

Compressor summary: Key points: - The text is about a novel approach to generate high-resolution images using diffusion models. - The new method avoids generating and averaging multiple overlapping predictions by shifting non-overlapping windows over time. - The method achieves better computational efficiency and image quality than existing techniques. Summary: The text presents a new approach for high-resolution image generation with diffusion models that improves efficiency and quality by shifting non-overlapping windows instead of averaging multiple predictions.


Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models

http://arxiv.org/abs/2407.15504v1

Compressor summary: The paper proposes a framework for compressing prompts for large language models, shows that query-aware compression is crucial, and introduces a new method to improve the performance of existing schemes.


WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation

http://arxiv.org/abs/2407.15502v1

Compressor summary: The paper introduces Web Rendering Parameters Generation (WebRPG), a new task that automates web page visualization based on their HTML code, and develops a dataset, baseline models, and evaluation methods for it.


TextureCrop: Enhancing Synthetic Image Detection through Texture-based Cropping

http://arxiv.org/abs/2407.15500v1

Compressor summary: TextureCrop improves synthetic image detection accuracy by focusing on high-frequency parts of images where generation artifacts are more common, outperforming center cropping and resizing methods in detecting harmful AI-generated content.


Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction

http://arxiv.org/abs/2407.15498v1

Compressor summary: The paper proposes a corpus refining strategy for Chinese Spelling Correction that combines data from two augmentation methods and filters noisy data to improve accuracy and calibration.


Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

http://arxiv.org/abs/2407.15489v1

Compressor summary: This paper compares multilingual pretraining objectives in a controlled way, finding that the model architecture and multilingual translation are key factors for success.


DiffX: Guide Your Layout to Cross-Modal Generative Modeling

http://arxiv.org/abs/2407.15488v1

Compressor summary: The paper introduces DiffX, a novel diffusion model that generates cross-modal images (RGB+X) guided by layouts and text descriptions using a modality-shared latent space and gated cross-attention.


In-Context Learning Improves Compositional Understanding of Vision-Language Models

http://arxiv.org/abs/2407.15487v1

Compressor summary: This paper studies why vision-language models struggle with compositional image understanding and proposes a method, in-context learning, to improve their performance on complex reasoning tasks.


Diverse Image Harmonization

http://arxiv.org/abs/2407.15481v1

Compressor summary: The proposed method adjusts foreground illumination in composite images using ground-truth reflectance guidance and diverse reflectance generation, producing multiple harmonized results.


Affordance Labeling and Exploration: A Manifold-Based Approach

http://arxiv.org/abs/2407.15479v1

Compressor summary: The paper explores how to use pre-trained networks for recognizing object affordances without modifying them and tests two methods that achieve high accuracy.


MODRL-TA:A Multi-Objective Deep Reinforcement Learning Framework for Traffic Allocation in E-Commerce Search

http://arxiv.org/abs/2407.15476v1

Compressor summary: The paper proposes a multi-objective deep reinforcement learning framework for traffic allocation in e-commerce platforms that balances multiple objectives and handles long-term value and cold start issues.


Learning deep illumination-robust features from multispectral filter array images

http://arxiv.org/abs/2407.15472v1

Compressor summary: Key points: - The paper proposes a new method to learn from raw multispectral images using three techniques: raw spectral constancy, MSFA-preserving transformations, and raw-mixing. - The method improves classification performance and reduces computational cost compared to existing methods. Summary: The paper presents a novel approach for multispectral image classification that uses raw images and three techniques to enhance discriminant and illumination-robust features.


Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval

http://arxiv.org/abs/2407.15459v1

Compressor summary: The Text-to-Battery Recipe (T2BR) protocol uses natural language processing to automatically extract recipes for LiFePO4 batteries from research papers, providing valuable insights and accelerating innovation.


GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs

http://arxiv.org/abs/2407.15452v1

Compressor summary: GraphScale improves scalability and efficiency of graph neural networks and node representation learning on large graphs by separating data storage and computation in a distributed setting.


SIGMA:Sinkhorn-Guided Masked Video Modeling

http://arxiv.org/abs/2407.15447v1

Compressor summary: SIGMA is a new method for pretraining videos that uses optimal transport to generate semantic and temporal features, leading to better video representations than existing methods.


Text2Place: Affordance-aware Text Guided Human Placement

http://arxiv.org/abs/2407.15446v1

Compressor summary: The paper proposes a method to realistically insert humans into various scenes using semantic masks and subject-conditioned inpainting, achieving high realism and preserving background and identity.


Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned

http://arxiv.org/abs/2407.15441v1

Compressor summary: The paper presents a system to detect and fix hallucination errors in LLMs using NER, NLI, SBD, and a decision tree, with a rewriting mechanism that balances precision, speed, and cost.


Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

http://arxiv.org/abs/2407.15439v1

Compressor summary: The paper studies how to fairly choose among options in situations where receiving feedback is delayed or correlated with rewards, such as crowdsourcing and online advertising.


Enhancement of 3D Gaussian Splatting using Raw Mesh for Photorealistic Recreation of Architectures

http://arxiv.org/abs/2407.15435v1

Compressor summary: The paper proposes a method to use raw 3D models to enhance the visual quality and accuracy of 3D architectural scenes reconstructed using 3D Gaussian Splatting, a mainstream technology in the industry.


YOLO-pdd: A Novel Multi-scale PCB Defect Detection Method Using Deep Representations with Sequential Images

http://arxiv.org/abs/2407.15427v1

Compressor summary: The paper presents a novel end-to-end method using YOLOv5 and multiscale modules to improve PCB defect detection accuracy, generalization, and real-time performance, outperforming existing methods.


Empirical Capacity Model for Self-Attention Neural Networks

http://arxiv.org/abs/2407.15425v1

Compressor summary: The paper studies how much information large transformer models can memorize and generalize using common training algorithms and synthetic data, and proposes a model to design task-specific models with the optimal number of parameters.


Planning behavior in a recurrent neural network that plays Sokoban

http://arxiv.org/abs/2407.15421v1

Compressor summary: Guez et al. (2019) studied how a recurrent neural network trained with reinforcement learning plans in Sokoban, finding that adding extra computation steps at test time improves its performance and reveals planning behavior.


Local All-Pair Correspondence for Point Tracking

http://arxiv.org/abs/2407.15420v1

Compressor summary: LocoTrack is a fast and accurate model for tracking any point across video sequences, using novel local 4D correlation and a lightweight encoder to overcome matching ambiguities.


LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

http://arxiv.org/abs/2407.15415v1

Compressor summary: LLaST is a framework that uses large language models to improve speech-to-text translation systems by optimizing model architecture and using various techniques like multilingual data augmentation.


Weights Shuffling for Improving DPSGD in Transformer-based Models

http://arxiv.org/abs/2407.15414v1

Compressor summary: This paper proposes a shuffling mechanism for Differentially Private Stochastic Gradient Descent (DPSGD) that enhances model utility without compromising privacy, using permutation invariance and an approximation on sum of lognormal distributions to analyze its performance.


Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

http://arxiv.org/abs/2407.15408v1

Compressor summary: The authors propose a method (CAR) to evaluate and improve the temporal alignment of language and 3D human motion representations in motion-language latent spaces.


Knowledge Mechanisms in Large Language Models: A Survey and Perspective

http://arxiv.org/abs/2407.15017v1

Compressor summary: This paper analyzes how large language models use, create, and change their knowledge, and identifies the challenges and opportunities for creating trustworthy artificial general intelligence.


Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

http://arxiv.org/abs/2407.15399v1

Compressor summary: The text describes a new attack method on large language models that exploits human conversation strategies to extract harmful information by manipulating the nature of the provided responses.


Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation

http://arxiv.org/abs/2407.15396v1

Compressor summary: The paper proposes a novel framework for scene graph generation that considers the semantic diversity of predicates to improve unbiased predictions.


ALLaM: Large Language Models for Arabic and English

http://arxiv.org/abs/2407.15390v1

Compressor summary: ALLaM is a large Arabic language model that leverages knowledge transfer, vocabulary expansion, and human preferences to achieve state-of-the-art performance in various benchmarks.


The Development of a Comprehensive Spanish Dictionary for Phonetic and Lexical Tagging in Socio-phonetic Research (ESPADA)

http://arxiv.org/abs/2407.15375v1

Compressor summary: The paper introduces ESPADA, a comprehensive and flexible pronunciation dictionary for Spanish with over 628,000 entries and various annotations, to improve speech forced alignment and dialectal research in the Spanish language.


ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts

http://arxiv.org/abs/2407.15374v1

Compressor summary: The paper presents a new English linguistic corpus from Twitter posts collected from news agencies and individuals, with annotations and visualizations for studying language patterns.


Sparse Prior Is Not All You Need: When Differential Directionality Meets Saliency Coherence for Infrared Small Target Detection

http://arxiv.org/abs/2407.15369v1

Compressor summary: The SDD framework uses directional characteristics and sparse constraints to improve infrared small target detection, outperforming ten existing methods.


Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias

http://arxiv.org/abs/2407.15366v1

Compressor summary: Perspective-taking prompting (PeT) is a new method that helps large language models reduce toxicity and bias by encouraging them to consider different human perspectives and self-correct their responses.


A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

http://arxiv.org/abs/2407.15362v1

Compressor summary: The text describes a new approach (mSTAR) that uses multimodal data from multiple sources, such as images and reports, to improve the performance of computational pathology models on various clinical tasks.


Dissecting Multiplication in Transformers: Insights into LLMs

http://arxiv.org/abs/2407.15360v1

Compressor summary: The paper analyzes why transformer-based models struggle with integer multiplication and proposes improvements to enhance their performance and interpretability, outperforming LLMs GPT-4.


UF-HOBI at "Discharge Me!": A Hybrid Solution for Discharge Summary Generation Through Prompt-based Tuning of GatorTronGPT Models

http://arxiv.org/abs/2407.15359v1

Compressor summary: The paper describes a hybrid method for generating discharge summaries using NER and GatorTronGPT, achieving 5th place in the "Discharge Me!" Challenge.


Attention Beats Linear for Fast Implicit Neural Representation Generation

http://arxiv.org/abs/2407.15355v1

Compressor summary: The paper proposes ANR, a novel method that combines localized attention and MLP for efficient and accurate implicit neural representation, achieving improved reconstruction results on four datasets.


Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

http://arxiv.org/abs/2407.15354v1

Compressor summary: The VectorFormer is a new camera-based 3D object detector that combines high-resolution vectors with low-resolution Bird's-Eye-View representations to improve 3D geometry detection efficiency and accuracy in multi-camera images.


Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA

http://arxiv.org/abs/2407.15353v1

Compressor summary: The paper proposes a customized retrieval augmented generation framework with domain-specific techniques for EDA tool documentation question-answering, and releases an evaluation benchmark ORD-QA.


MAVEN-Fact: A Large-scale Event Factuality Detection Dataset

http://arxiv.org/abs/2407.15352v1

Compressor summary: The authors introduce MAVEN-Fact, a large and high-quality event factuality detection dataset based on the MAVEN dataset, which helps improve understanding of textual events.


LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation

http://arxiv.org/abs/2407.15351v1

Compressor summary: The paper proposes using a Large Language Model as a Bayesian Inference module to improve Graph Neural Network interpretability and address learning bias in unsupervised models.


RoadPainter: Points Are Ideal Navigators for Topology transformER

http://arxiv.org/abs/2407.15349v1

Compressor summary: RoadPainter is a new method for accurately detecting lane centerlines in road scenes using multi-view images and improving topological reasoning with additional points and an optional SD map module.


Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models

http://arxiv.org/abs/2407.15346v1

Compressor summary: DKA is a framework that improves KVQA by disentangling knowledge acquisition and using LLM feedback to generate simple sub-questions for accurate answers.


Improving Minimum Bayes Risk Decoding with Multi-Prompt

http://arxiv.org/abs/2407.15343v1

Compressor summary: Multi-prompt decoding improves text generation by using many candidates from a prompt bank and selecting the best one with Minimum Bayes Risk decoding.


ZZU-NLP at SIGHAN-2024 dimABSA Task: Aspect-Based Sentiment Analysis with Coarse-to-Fine In-context Learning

http://arxiv.org/abs/2407.15341v1

Compressor summary: The proposed CFICL method enhances sentiment recognition and improves DimABSA prediction accuracy using in-context learning and similarity-based example selection.


ThermalNeRF: Thermal Radiance Fields

http://arxiv.org/abs/2407.15337v1

Compressor summary: The paper proposes a method to reconstruct 3D scenes from LWIR and RGB images using a multispectral radiance field, improving thermal super-resolution and object visibility.


Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

http://arxiv.org/abs/2407.15328v1

Compressor summary: The paper proposes a novel framework for diffusion models that reduces data memorization risk using iterative ensemble training and anti-gradient control, improving performance on four datasets.


Odyssey: Empowering Agents with Open-World Skills

http://arxiv.org/abs/2407.15325v1

Compressor summary: ODYSSEY is a new framework for Minecraft agents that uses a large language model and an open-world skill library to explore the game world and solve various tasks.


Open-CD: A Comprehensive Toolbox for Change Detection

http://arxiv.org/abs/2407.15317v1

Compressor summary: Open-CD is a comprehensive change detection toolbox with various methods, components, and analysis scripts that aims to facilitate research and collaboration in the field.


FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification

http://arxiv.org/abs/2407.15312v1

Compressor summary: The Fuzzy-guided Multi-granularity Deep Neural Network (FMDNN) is a novel approach to histopathological image classification that mimics the multi-granular diagnostic method of pathologists and uses fuzzy logic to handle redundant information, improving accuracy and interpretability.


Fever Detection with Infrared Thermography: Enhancing Accuracy through Machine Learning Techniques

http://arxiv.org/abs/2407.15302v1

Compressor summary: The authors improved the accuracy and reliability of infrared thermometers by integrating machine learning algorithms with infrared thermography, which can help diagnose infectious diseases like COVID-19.