arxiv compressed, 2024-07-12

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-12 generated by the compressor, my personal LLM-based project.


MAVIS: Mathematical Visual Instruction Tuning

http://arxiv.org/abs/2407.08739v1

Compressor summary: MAVIS is a new paradigm to improve large language models' mathematical problem-solving skills in visual contexts by providing specialized data, vision encoders, and instruction tuning.


Video Diffusion Alignment via Reward Gradients

http://arxiv.org/abs/2407.08737v1

Compressor summary: Key points: - The paper proposes using pre-trained reward models to adapt video diffusion models for specific tasks - The reward models contain dense gradient information that helps learn efficiently in complex search spaces - The approach outperforms gradient-free methods in terms of reward queries and computation Summary: The paper presents a method to improve video diffusion models by using pre-trained reward models with rich gradients, achieving more efficient learning than gradient-free alternatives.


Transformer Circuit Faithfulness Metrics are not Robust

http://arxiv.org/abs/2407.08734v1

Compressor summary: The text discusses challenges in measuring the performance of subgraphs (circuits) in neural networks, emphasizing the need for clarity and better methods in mechanistic interpretability work.


Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

http://arxiv.org/abs/2407.08733v1

Compressor summary: MATHCHECK is a tool that evaluates the mathematical reasoning ability of large language models across diverse tasks, providing a better reflection of their true intelligence.


BiEquiFormer: Bi-Equivariant Representations for Global Point Cloud Registration

http://arxiv.org/abs/2407.08729v1

Compressor summary: The paper proposes BiEquiformer, a bi-equivariant deep learning pipeline for global point cloud registration that fuses information from both point clouds using expressive layers, achieving superior performance in robust settings.


Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data

http://arxiv.org/abs/2407.08726v1

Compressor summary: MIA is a data engine that uses Mapillary and OpenStreetMap to create a scalable and diverse dataset for predicting bird's eye view maps from first-person view images, improving autonomous navigation performance.


MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces

http://arxiv.org/abs/2407.08725v1

Compressor summary: MetaUrban is a simulation platform for testing embodied AI in urban spaces, improving generalizability and safety of mobile agents.


Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms

http://arxiv.org/abs/2407.08723v1

Compressor summary: The paper introduces new topology-based complexity measures that correlate with generalization error in deep neural networks without assuming continuous-time training dynamics or restrictive geometric assumptions.


WhisperNetV2: SlowFast Siamese Network For Lip-Based Biometrics

http://arxiv.org/abs/2407.08717v1

Compressor summary: WhisperNetV2 is a novel deep learning network for lip-based biometric authentication that considers emotions and uses SlowFast networks to extract behavioral and physiological features, achieving state-of-the-art performance.


A Taxonomy for Data Contamination in Large Language Models

http://arxiv.org/abs/2407.08716v1

Compressor summary: The text discusses data contamination in large language models, its types, and its effect on downstream tasks like summarization and question answering.


Sensor-Aware Classifiers for Energy-Efficient Time Series Applications on IoT Devices

http://arxiv.org/abs/2407.08715v1

Compressor summary: The proposed approach uses early exit classifiers with partial sensor windows to minimize energy consumption while maintaining accuracy in time-series applications, enabling significant energy savings and allowing for remote use in limited energy situations.


GTA: A Benchmark for General Tool Agents

http://arxiv.org/abs/2407.08713v1

Compressor summary: Key points: - The text proposes GTA, a benchmark for General Tool Agents that simulates real-world scenarios with real user queries, real deployed tools, and real multimodal inputs. - Current evaluations of LLMs' tool-use capabilities are not effective in revealing their problem-solving abilities. - Existing LLMs perform poorly on GTA tasks, indicating bottlenecks in their tool-use capabilities. Summary: The authors introduce GTA, a realistic benchmark for evaluating LLMs' ability to use tools in various scenarios, and show that current LLMs struggle with many real-world tasks.


OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

http://arxiv.org/abs/2407.08711v1

Compressor summary: OmniNOCS is a large dataset with diverse object classes and annotations that enables training of a novel 3D detection model with shape and segmentation information, which achieves comparable results to state-of-the-art methods.


Extracting Training Data from Document-Based VQA Models

http://arxiv.org/abs/2407.08707v1

Compressor summary: This paper studies how vision-language models can memorize and regurgitate sensitive information from training documents, and proposes a method to prevent this privacy risk.


HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

http://arxiv.org/abs/2407.08706v1

Compressor summary: HiRes-LLaVA is a framework that efficiently processes high-resolution inputs without losing contextual and geometric information by using a SliceRestore adapter and a Self-Mining Sampler.


Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

http://arxiv.org/abs/2407.08701v1

Compressor summary: Live2Diff is a new video diffusion model with uni-directional temporal attention for live streaming video translation that ensures temporal consistency and smoothness and can process videos at interactive framerates.


Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

http://arxiv.org/abs/2407.08699v1

Compressor summary: The text proposes a new method called Branch-and-Merge (BaM) for adapting large language models to different languages, which reduces forgetting of the source domain and maintains or improves target domain performance.


Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers

http://arxiv.org/abs/2407.08689v1

Compressor summary: This text discusses the need for regulatory frameworks to ensure trustworthy and ethical AI tools, and provides an accessible overview of existing literature on operationalizing these principles, highlighting gaps and trade-offs between guidelines and current AI research.


SEED-Story: Multimodal Long Story Generation with Large Language Model

http://arxiv.org/abs/2407.08683v1

Compressor summary: The proposed method SEED-Story generates extended multimodal stories using a Multimodal Large Language Model, predicting both text and visual tokens, and employing an efficient autoregressive mechanism with a new dataset called StoryStream.


Generalizable Implicit Motion Modeling for Video Frame Interpolation

http://arxiv.org/abs/2407.08680v1

Compressor summary: GIMM is a novel approach to motion modeling for VFI that uses spatiotemporal motion latent from bidirectional flows and implicit prediction of optical flows via an adaptive neural network, achieving better performance than existing methods.


CAD-Prompted Generative Models: A Pathway to Feasible and Novel Engineering Designs

http://arxiv.org/abs/2407.08675v1

Compressor summary: The paper proposes a method to improve the feasibility of designs generated by text-to-image models by using CAD images as prompts, and evaluates it on bike design examples with Stable Diffusion 2.1.


Still-Moving: Customized Video Generation without Customized Video Data

http://arxiv.org/abs/2407.08674v1

Compressor summary: Still-Moving is a framework to customize text-to-video models without needing customized video data, using spatial adapters trained on frozen videos and a motion adapter module.


NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning

http://arxiv.org/abs/2407.08672v1

Compressor summary: The paper proposes NODE-Adapter, a novel method using Neural ODEs for better vision-language reasoning by constructing and optimizing cross-modal prototypes for downstream tasks.


Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images

http://arxiv.org/abs/2407.08669v1

Compressor summary: The paper proposes an attention mechanism guided by image segmentation for visual question answering in remote sensing, and introduces a new dataset with high-resolution images and questions/answers.


Uncertainty Estimation of Large Language Models in Medical Question Answering

http://arxiv.org/abs/2407.08662v1

Compressor summary: The paper proposes a new uncertainty estimation method for natural language generation in healthcare using large language models, which generates explanations and verification questions to detect inconsistencies and measure uncertainty.


Controlling the Fidelity and Diversity of Deep Generative Models via Pseudo Density

http://arxiv.org/abs/2407.08659v1

Compressor summary: The text introduces a new way to control the quality and variety of data generated by deep models using a metric called pseudo density, which allows adjustments for individual samples and different techniques for enhancing fidelity or diversity.


Confidence-based Estimators for Predictive Performance in Model Monitoring

http://arxiv.org/abs/2407.08649v1

Compressor summary: The paper explores a simple method for estimating model accuracy when ground truth labels are unavailable, compares it with other methods, and shows that it often outperforms them.


Towards Building Specialized Generalist AI with System 1 and System 2 Fusion

http://arxiv.org/abs/2407.08642v1

Compressor summary: The paper introduces Specialized Generalist Artificial Intelligence (SGAI), a milestone toward Artificial General Intelligence, which combines human-level expertise in specific tasks with general abilities, and proposes a framework for developing it.


How more data can hurt: Instability and regularization in next-generation reservoir computing

http://arxiv.org/abs/2407.08641v1

Compressor summary: Increasing data can harm dynamics learning in reservoir computing due to instability from delayed states, but using regularization or noise can help mitigate it.


Modality Agnostic Heterogeneous Face Recognition with Switch Style Modulators

http://arxiv.org/abs/2407.08640v1

Compressor summary: The paper introduces a novel framework that trains a modality-agnostic face recognition model using automatic routing mechanism, enabling it to handle multiple modalities without explicit target label knowledge.


$β$-DPO: Direct Preference Optimization with Dynamic $β$

http://arxiv.org/abs/2407.08639v1

Compressor summary: The paper introduces a dynamic method to adjust the trade-off parameter in Direct Preference Optimization, which improves the alignment of large language models with human feedback based on data quality.


RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

http://arxiv.org/abs/2407.08634v1

Compressor summary: The text introduces RTMW, a series of high-performance models for 2D/3D whole-body pose estimation that capture pose information from different body parts with various scales and achieve strong performance on multiple benchmarks.


A Novel Framework for Automated Warehouse Layout Generation

http://arxiv.org/abs/2407.08633v1

Compressor summary: The AI-driven framework generates optimal warehouse layouts based on spatial constraints, functional requirements, and accessibility criteria.


Generalization Error Matters in Decentralized Learning Under Byzantine Attacks

http://arxiv.org/abs/2407.08632v1

Compressor summary: The paper analyzes the generalization errors in Byzantine-resilient decentralized learning algorithms and shows that they cannot be fully eliminated due to malicious agents.


RoboMorph: Evolving Robot Morphology using Large Language Models

http://arxiv.org/abs/2407.08626v1

Compressor summary: RoboMorph is an automated method that uses large language models and evolutionary algorithms to generate and optimize modular robot designs efficiently.


Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric (DIEM)

http://arxiv.org/abs/2407.08623v1

Compressor summary: The paper analyzes the limitations of common metrics for high-dimensional comparisons and introduces a new dimension insensitive metric, DIEM, which overcomes these limitations for better interpretability and accuracy.


Tamil Language Computing: the Present and the Future

http://arxiv.org/abs/2407.08618v1

Compressor summary: The paper explores the text processing aspects of Language Computing, discussing advancements in deep learning, computational resources, linguistic annotation, and practical applications for languages like Tamil, emphasizing the need for more research collaboration and digitization.


FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

http://arxiv.org/abs/2407.08608v1

Compressor summary: The paper proposes FlashAttention-3, which improves the speed and accuracy of attention in large language models on Hopper GPUs using three techniques: warp-specialization, interleaved matmul and softmax operations, and block quantization and incomerent processing.


Turn-Level Empathy Prediction Using Psychological Indicators

http://arxiv.org/abs/2407.08607v1

Compressor summary: The paper presents a new empathy detection method that uses six psychological indicators and improves empathy prediction with a large language model and fine-tuning, ranking 7th in a shared task.


A Review of Nine Physics Engines for Reinforcement Learning Research

http://arxiv.org/abs/2407.08590v1

Compressor summary: The text reviews nine popular simulation frameworks for reinforcement learning (RL) research, comparing them based on various criteria and highlighting their strengths and weaknesses.


The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

http://arxiv.org/abs/2407.08583v1

Compressor summary: Key points: - Large language models (LLMs) and multi-modal language models (MLLMs) are powerful tools for various applications. - Data is crucial for the development of LLMs and MLLMs, and data and models co-develop each other. - The paper reviews existing works related to MLLMs from the data-model co-development perspective. Summary: The paper explores how large and multi-modal language models rely on data and co-develop with it for various applications.


On the Universal Truthfulness Hyperplane Inside LLMs

http://arxiv.org/abs/2407.08582v1

Compressor summary: The paper explores if a universal truthfulness hyperplane can be found within LLMs to distinguish factual correct and incorrect outputs across different tasks and domains using diverse datasets.


Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space

http://arxiv.org/abs/2407.08572v1

Compressor summary: The paper proposes a new post-train Dual Bayesian strategy to improve adversarial transferability in skeleton-based human activity recognition models by smoothing the loss landscape and crafting adversarial examples along motion dynamics.


Multi-Group Proportional Representation

http://arxiv.org/abs/2407.08571v1

Compressor summary: The text proposes Multi-Group Proportional Representation (MPR), a new metric to measure and improve representation of intersectional groups in image search and retrieval tasks, addressing the limitations of existing methods that only consider single or binary attributes.


Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

http://arxiv.org/abs/2407.08569v1

Compressor summary: LiSe is a new method that uses LiDAR data and 2D images for unsupervised 3D object detection, improving performance by adaptively sampling and aggregating weak models.


Adaptive Parametric Activation

http://arxiv.org/abs/2407.08567v1

Compressor summary: The paper introduces APA, a versatile activation function that adapts to the data distribution and improves performance in both balanced and imbalanced tasks.


The Career Interests of Large Language Models

http://arxiv.org/abs/2407.08564v1

Compressor summary: This study explores how large language models' career interests and competencies vary with language changes and model advancements, revealing their human-like tendencies and potential implications for integrating them into professional environments.


Vox Populi, Vox AI? Using Language Models to Estimate German Public Opinion

http://arxiv.org/abs/2407.08563v1

Compressor summary: This study finds that GPT-3.5, a large language model, does not accurately predict vote choice in Germany based on synthetic samples generated from survey data, highlighting the limitations of using LLMs for public opinion estimation.


MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

http://arxiv.org/abs/2407.08561v1

Compressor summary: Our neural re-localization method using transformers improves autonomous driving by registering navigation maps to visual features, providing accurate and fast localization without HD maps.


ST-Mamba: Spatial-Temporal Mamba for Traffic Flow Estimation Recovery using Limited Data

http://arxiv.org/abs/2407.08558v1

Compressor summary: Key points: - TFE is important for intelligent traffic systems but traditional methods are costly and limited in coverage - Cloud computing and vehicular network data offer a promising alternative - ST-Mamba is a deep learning model that combines CNN with Mamba framework to enhance TFE accuracy and stability - ST-Mamba uses minimal data and achieves precise and stable TFE results Summary: ST-Mamba, a deep learning model that integrates CNN and Mamba framework, improves traffic flow estimation (TFE) accuracy and stability using minimal vehicular network data in a cost-effective way.


Establishing Rigorous and Cost-effective Clinical Trials for Artificial Intelligence Models

http://arxiv.org/abs/2407.08554v1

Compressor summary: The text discusses the gap between AI and clinical practice in medicine, suggesting new evaluation methods that involve patients and clinicians to improve AI's impact on healthcare.


Autoregressive Speech Synthesis without Vector Quantization

http://arxiv.org/abs/2407.08551v1

Compressor summary: MELLE is a new TTS method that generates mel-spectrograms directly from text using continuous tokens and regression loss, improving fidelity and diversity over discrete codec models.


Incorporating Large Language Models into Production Systems for Enhanced Task Automation and Flexibility

http://arxiv.org/abs/2407.08550v1

Compressor summary: The paper presents a method to integrate large language models into automated production systems, enhancing task automation and flexibility by using digital twins and microservices.


Quantitative Evaluation of the Saliency Map for Alzheimer's Disease Classifier with Anatomical Segmentation

http://arxiv.org/abs/2407.08546v1

Compressor summary: The paper proposes a new evaluation metric, VCS, that correlates saliency maps of deep learning classifiers for Alzheimer's disease with brain volume changes across different regions, improving the understanding of the model's decisions process.


Exemplar-free Continual Representation Learning via Learnable Drift Compensation

http://arxiv.org/abs/2407.08536v1

Compressor summary: The paper introduces Learnable Drift Compensation (LDC), a method to mitigate semantic drift in prototype-based continual learning and achieve state-of-the-art performance in supervised and semi-supervised settings.


BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight

http://arxiv.org/abs/2407.08526v1

Compressor summary: The paper proposes BLOS-BEV, a model that combines onboard camera visual information and SD maps to extend the perception range of autonomous vehicles to 200 meters for better scene understanding and planning.


Emergent Visual-Semantic Hierarchies in Image-Text Representations

http://arxiv.org/abs/2407.08521v1

Compressor summary: Key points: - VLMs like CLIP are powerful but don't model hierarchical text structure for images - Existing methods need costly training and don't use foundation models - Foundation models have emergent understanding of visual-semantic hierarchies - Radial Embedding framework probes and optimizes hierarchical understanding - HierarCaps dataset is a benchmark for studying image-text hierarchies - Foundation models outperform prior models in zero-shot hierarchical understanding - Text-only fine-tuning improves alignment to hierarchical reasoning Summary: The authors propose a framework and a dataset to study and improve hierarchical understanding of images and texts in foundation models like CLIP, which already have emergent knowledge of visual-semantic hierarchies.


Generalized Low-Rank Matrix Completion Model with Overlapping Group Error Representation

http://arxiv.org/abs/2407.08517v1

Compressor summary: The paper proposes a generalized low-rank matrix completion model using overlapping group error representation to better capture global and local structure information of real data.


Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

http://arxiv.org/abs/2407.08516v1

Compressor summary: The article discusses how connectionist and symbolic AI are converging in large language models like ChatGPT, which enable autonomous agents with enhanced reasoning and decision-making capabilities compared to knowledge graphs.


15M Multimodal Facial Image-Text Dataset

http://arxiv.org/abs/2407.08515v1

Compressor summary: The paper introduces FaceCaption-15M, a large dataset of facial images with natural language descriptions, to facilitate research on face-centered tasks and achieve state-of-the-art results on two challenging tasks using FLIP-based models.


Fine-Tuning Stable Diffusion XL for Stylistic Icon Generation: A Comparison of Caption Size

http://arxiv.org/abs/2407.08513v1

Compressor summary: The paper discusses different methods to fine-tune Stable Diffusion XL for generating high-quality 2D icons and emphasizes the importance of defining "high-quality" in a commercial setting, as well as the limitations of FID and CLIP scores for evaluating icon quality.


Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement

http://arxiv.org/abs/2407.08507v1

Compressor summary: The paper proposes a novel self-supervised framework using vision-language models to estimate remote physiological signals from facial videos, improving performance over existing methods.


Latent Conditional Diffusion-based Data Augmentation for Continuous-Time Dynamic Graph Mode

http://arxiv.org/abs/2407.08500v1

Compressor summary: Conda is a new method that improves dynamic graph learning by generating better embeddings for target nodes using latent diffusion and Variational Auto-Encoder techniques.


ERD: Exponential Retinex decomposition based on weak space and hybrid nonconvex regularization and its denoising application

http://arxiv.org/abs/2407.08498v1

Compressor summary: The paper proposes a new Retinex model for image denoising that uses non-convex regularization and weak space oscillation, and demonstrates its effectiveness in removing noise from images.


CE-QArg: Counterfactual Explanations for Quantitative Bipolar Argumentation Frameworks (Technical Report)

http://arxiv.org/abs/2407.08497v1

Compressor summary: The paper introduces a method to explain and change the strength of arguments in Quantitative Bipolar Argumentation Frameworks using counterfactual explanations.


Investigating LLMs as Voting Assistants via Contextual Augmentation: A Case Study on the European Parliament Elections 2024

http://arxiv.org/abs/2407.08495v1

Compressor summary: The authors investigate if large language models can be used as voting advice applications in the European Parliament elections and explore ways to improve their performance, finding that MIXTRAL is highly accurate and expert-curated information boosts accuracy by 9%.


Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

http://arxiv.org/abs/2407.08489v1

Compressor summary: The paper introduces a new way to detect objects with orientation, using points and axes, that improves performance and avoids some common problems.


Lynx: An Open Source Hallucination Evaluation Model

http://arxiv.org/abs/2407.08488v1

Compressor summary: LYNX is a state-of-the-art hallucination detection model that can reason well in real-world scenarios and outperforms other models on the new HaluBench benchmark.


Learning Localization of Body and Finger Animation Skeleton Joints on Three-Dimensional Models of Human Bodies

http://arxiv.org/abs/2407.08484v1

Compressor summary: Key points: - The paper proposes a deep learning method to position skeleton joints in 3D human body models - It uses synthetic samples and input points with normal vectors - It outperforms the state-of-the-art with simpler architecture and faster processing times Summary: The paper presents a simple and fast deep learning approach to locate skeleton joints in 3D human models using synthetic data and point features.


Robust Generalization of Graph Neural Networks for Carrier Scheduling

http://arxiv.org/abs/2407.08479v1

Compressor summary: RobustGANTT is a GNN-based scheduler that efficiently and adaptively computes carrier schedules for battery-free sensor tags in large-scale IoT networks.


VideoMamba: Spatio-Temporal Selective State Space Model

http://arxiv.org/abs/2407.08476v1

Compressor summary: VideoMamba is an efficient and effective video recognition model that uses Mamba's linear complexity and selective SSM mechanism to capture spatial and temporal information in videos.


Investigating Public Fine-Tuning Datasets: A Complex Review of Current Practices from a Construction Perspective

http://arxiv.org/abs/2407.08475v1

Compressor summary: This paper reviews public fine-tuning datasets for large models, focusing on their construction techniques and methods to provide a comprehensive overview and guide future research.


Brain Tumor Segmentation in MRI Images with 3D U-Net and Contextual Transformer

http://arxiv.org/abs/2407.08470v1

Compressor summary: The paper proposes a 3D-UNet model with a Context Transformer to segment brain tumors in MRI scans, achieving high accuracy compared to existing methods.


TLDR: Unsupervised Goal-Conditioned RL via Temporal Distance-Aware Representations

http://arxiv.org/abs/2407.08464v1

Compressor summary: The text introduces a new unsupervised GCRL method called TLDR, which uses temporal distance to guide exploration and goal-reaching in complex robotic environments.


Semi-Supervised Object Detection: A Survey on Progress from CNN to Transformer

http://arxiv.org/abs/2407.08460v1

Compressor summary: The paper reviews 27 state-of-the-art semi-supervised object detection methods that use a mix of labeled and unlabeled data to improve performance and reduce dependence on expensive labeled datasets.


Joint Optimization of Age of Information and Energy Consumption in NR-V2X System based on Deep Reinforcement Learning

http://arxiv.org/abs/2407.08458v1

Compressor summary: The paper proposes a DRL-based optimization method for reducing energy consumption and AoI in NR-V2X communication by employing interference cancellation with NOMA technology.


Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending

http://arxiv.org/abs/2407.08457v1

Compressor summary: The Neural Poisson Solver is a framework for blending visual signals represented by Implicit Neural Representations (INRs), using a continuous variational problem-solving approach to achieve natural, distortion-free results.


Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks

http://arxiv.org/abs/2407.08454v1

Compressor summary: The paper proposes KVMerger, a novel technique to compress KV cache for large language models in long-context scenarios without losing information or degrading performance under constrained memory budgets.


Paving the way toward foundation models for irregular and unaligned Satellite Image Time Series

http://arxiv.org/abs/2407.08448v1

Compressor summary: ALISE is a novel method that produces aligned latent representations for satellite remote sensing imagery using spatial, spectral, and temporal dimensions, improving performance in crop segmentation and change detection tasks.


WildGaussians: 3D Gaussian Splatting in the Wild

http://arxiv.org/abs/2407.08447v1

Compressor summary: WildGaussians is a novel method that combines robust features, appearance modeling, and 3D Gaussian Splatting to handle occlusions and appearance changes in 3D scene reconstruction, achieving state-of-the-art results with real-time rendering speed.


Infinite Motion: Extended Motion Generation via Long Text Instructions

http://arxiv.org/abs/2407.08443v1

Compressor summary: The paper introduces "Infinite Motion", a novel approach for creating long and high-quality motion sequences from text, with applications in editing and splicing.


How Deep is your Guess? A Fresh Perspective on Deep Learning for Medical Time-Series Imputation

http://arxiv.org/abs/2407.08442v1

Compressor summary: The paper presents a new framework for classifying time-series imputation methods using deep learning, especially for clinical data, and discusses how to choose the best approach depending on the data properties and missingness scenarios.


Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

http://arxiv.org/abs/2407.08441v1

Compressor summary: This study examines the biases in large language models and explores how prompt engineering can reveal them, highlighting the need for better mitigation techniques for a fairer AI.


Beyond Instruction Following: Evaluating Rule Following of Large Language Models

http://arxiv.org/abs/2407.08440v1

Compressor summary: The paper clarifies the concept of rule-following, creates RuleBench benchmark to evaluate LLMs' rule-following abilities, and finds that current LLMs are limited in following rules.


Improve Load Forecasting in Energy Communities through Transfer Learning using Open-Access Synthetic Profiles

http://arxiv.org/abs/2407.08434v1

Compressor summary: Using synthetic load profiles and transfer learning techniques improves the forecast accuracy of power consumption in the absence of historical data.


Subgroup-Specific Risk-Controlled Dose Estimation in Radiotherapy

http://arxiv.org/abs/2407.08432v1

Compressor summary: The text describes a new algorithm (SG-RCPS) that improves uncertainty quantification in radiotherapy planning using magnetic resonance-guided linear accelerators, by providing prediction intervals for multiple subgroups with unknown membership at test time.


A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights

http://arxiv.org/abs/2407.08428v1

Compressor summary: The text surveys recent progress in human video generation using generative models, discussing methods for text-driven, audio-driven, and pose-driven motion generation, and reviewing datasets, evaluation metrics, and future research directions.


PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines

http://arxiv.org/abs/2407.08418v1

Compressor summary: PredBench is a benchmark for evaluating spatio-temporal prediction networks with diverse datasets and methods, offering comprehensive insights into their performance.


Unveiling the Potential of BERTopic for Multilingual Fake News Analysis -- Use Case: Covid-19

http://arxiv.org/abs/2407.08417v1

Compressor summary: Topic modeling uses BERTopic, a method that needs careful hyperparameter tuning and can reveal thematic similarities between countries on Covid-19 fake news.


Parallelizing Autoregressive Generation with Variational State Space Models

http://arxiv.org/abs/2407.08415v1

Compressor summary: The variational SSM combines variational autoencoders and state space models to enable parallel training and generation for autoregressive sequence modeling.


MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

http://arxiv.org/abs/2407.08414v1

Compressor summary: The paper proposes a new method to create realistic human avatars from multi-view videos using physics-based rendering and neural networks, overcoming the limitations of existing methods based on neural radiance fields.


CLEO: Continual Learning of Evolving Ontologies

http://arxiv.org/abs/2407.08411v1

Compressor summary: CLEO is a new continual learning setting for intelligent systems to adapt to evolving ontologies, while existing methods struggle with it.


Specialist vision-language models for clinical ophthalmology

http://arxiv.org/abs/2407.08410v1

Compressor summary: This paper shows how a specially trained AI model can outperform existing models and junior ophthalmologists in writing reports about age-related macular degeneration, demonstrating the potential of AI for medical diagnosis and care.


Self-training Language Models for Arithmetic Reasoning

http://arxiv.org/abs/2407.08400v1

Compressor summary: The authors explore how language models can improve arithmetic reasoning without new data by using automated feedback from their predictions (self-training), achieving better results than supervised methods in online self-training.


Using deep neural networks to detect non-analytically defined expert event labels in canoe sprint force sensor signals

http://arxiv.org/abs/2407.08395v1

Compressor summary: The paper investigates using neural networks and a modified metric to automatically detect paddle strokes in canoe sprint training sessions.


Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

http://arxiv.org/abs/2407.08394v1

Compressor summary: Diff-Tracker uses a pre-trained text-to-image diffusion model to learn a prompt for unsupervised visual tracking and updates it online as the target moves.


On the attribution of confidence to large language models

http://arxiv.org/abs/2407.08388v1

Compressor summary: The paper discusses the theoretical basis of assigning confidence levels to Large Language Models (LLMs) and argues that their existence as belief systems is plausible but uncertain, while their attribution may be inaccurate due to experimental limitations.


Digital twins to alleviate the need for real field data in vision-based vehicle speed detection systems

http://arxiv.org/abs/2407.08380v1

Compressor summary: The authors propose using digital twins with CARLA simulator to generate a large dataset for accurate vision-based speed estimation, and achieve a low mean absolute error of under 3 km/h.


Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework

http://arxiv.org/abs/2407.08377v1

Compressor summary: The paper introduces a new dataset and method for handling atmospheric turbulence in long-range imaging, using dynamic and static priors to mitigate severe distortions effectively.


Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Cross-Regularization

http://arxiv.org/abs/2407.08374v1

Compressor summary: The paper introduces OrthCR, a method to finetune vision-language models like CLIP for specific tasks by injecting orthogonal matrices into the transformer architecture and using cross-regularization to enhance robustness, stability, and generalization.


Scalar Function Topology Divergence: Comparing Topology of 3D Objects

http://arxiv.org/abs/2407.08364v1

Compressor summary: SFTD is a new tool for comparing multi-scale topology of scalar functions defined on graphs or Euclidean spaces, improving applications in 3D computer vision such as cellular shape reconstruction and error detection.


STAL: Spike Threshold Adaptive Learning Encoder for Classification of Pain-Related Biosignal Data

http://arxiv.org/abs/2407.08362v1

Compressor summary: The paper proposes a new spiking neural network method for chronic lower back pain classification using an adaptive encoder and an ensemble of recurrent neural networks, achieving superior performance over traditional methods.


Event-based vision on FPGAs -- a survey

http://arxiv.org/abs/2407.08356v1

Compressor summary: The paper reviews FPGA-based solutions for processing event camera data in various applications, highlighting their advantages in energy efficiency and real-time performance.


AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models

http://arxiv.org/abs/2407.08351v1

Compressor summary: The paper proposes AutoBencher, a tool that creates novel and difficult benchmarks for language models by searching for datasets satisfying salience, novelty, and difficulty criteria.


Spine Vision X-Ray Image based GUI Planning of Pedicle Screws Using Enhanced YOLOv5 for Vertebrae Segmentation

http://arxiv.org/abs/2407.08349v1

Compressor summary: The paper presents a GUI for precise spinal screw placement using vertebrae segmentation from X-Ray images, improving preoperative planning and intra-operative guidance.


Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

http://arxiv.org/abs/2407.08348v1

Compressor summary: The paper explores how data scaling affects mathematical reasoning in large language models and introduces the Skywork-Math series, which outperforms GPT-4 on certain benchmarks.


Adaptive Deep Iris Feature Extractor at Arbitrary Resolutions

http://arxiv.org/abs/2407.08341v1

Compressor summary: Key points: - Paper proposes a deep feature extractor for iris recognition at arbitrary resolutions - Resolution degradation reduces recognition performance of deep learning models - Method of resolution-adaptive feature extraction with automatically switching networks improves robustness and performance - Framework includes resolution expert modules specialized for different degradations - Lower-resolution experts are trained by knowledge-distillation from high-resolution expert Summary: The paper presents a method that adapts to different image resolutions and degradations for iris recognition using specialized networks that switch according to the input condition.


SLRL: Structured Latent Representation Learning for Multi-view Clustering

http://arxiv.org/abs/2407.08340v1

Compressor summary: SLRL is a novel framework for Multi-View Clustering that leverages both complementary and structural information among samples to improve clustering outcomes.


SR-Mamba: Effective Surgical Phase Recognition with State Space Model

http://arxiv.org/abs/2407.08333v1

Compressor summary: Key points: - SR-Mamba is a novel attention-free model for surgical phase recognition - It uses a bidirectional Mamba decoder to model long-distance temporal relationships in overlong sequences - It enables single-step neural network training, improving accuracy and simplifying the process - It achieves state-of-the-art performance on two datasets Summary: SR-Mamba is a new attention-free model that uses a bidirectional Mamba decoder to accurately recognize surgical phases in overlong sequences with single-step neural network training.


HDT: Hierarchical Document Transformer

http://arxiv.org/abs/2407.08330v1

Compressor summary: The Hierarchical Document Transformer (HDT) is a new sparse Transformer architecture that uses document structure to improve efficiency and performance for tasks like science, law, or medicine.


Unveiling Disparities in Maternity Care: A Topic Modelling Approach to Analysing Maternity Incident Investigation Reports

http://arxiv.org/abs/2407.08328v1

Compressor summary: The study uses natural language processing to analyse maternity care incidents, revealing disparities among ethnic groups and stressing the importance of data analysis for improving quality and equity.


A Cantor-Kantorovich Metric Between Markov Decision Processes with Application to Transfer Learning

http://arxiv.org/abs/2407.08324v1

Compressor summary: The paper introduces and applies a new metric for measuring similarity between Markov decision processes, which can help improve transfer learning in reinforcement learning.


Intelligent Multi-Document Summarisation for Extracting Insights on Racial Inequalities from Maternity Incident Investigation Reports

http://arxiv.org/abs/2407.08322v1

Compressor summary: The paper introduces I-SIRch:CS, a framework that uses AI to analyse safety incident reports in healthcare, identify patterns, and improve safety by learning from past incidents.


Improving Molecular Modeling with Geometric GNNs: an Empirical Study

http://arxiv.org/abs/2407.08313v1

Compressor summary: The paper studies how different aspects of Geometric Graph Neural Networks affect the accuracy, efficiency, and symmetry preservation of 3D atomic systems for molecular modeling.


DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

http://arxiv.org/abs/2407.08303v1

Compressor summary: Perceptual Fusion is a method to create a large image-text dataset by fusing diverse perception experts and an efficient multimodal language model, improving comprehensive visual perception for existing models.


Impact Measures for Gradual Argumentation Semantics

http://arxiv.org/abs/2407.08302v1

Compressor summary: The paper presents two refined and new impact measures for argumentation, compares them with gradual semantics, and analyzes their performance.


XAI-Guided Enhancement of Vegetation Indices for Crop Mapping

http://arxiv.org/abs/2407.08298v1

Compressor summary: The authors propose an AI-based method to select and design vegetation indices for monitoring crop growth using multispectral satellite data and deep neural networks.


Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

http://arxiv.org/abs/2407.08296v1

Compressor summary: Q-Galore is a novel method that combines quantization and low-rank projection to reduce memory usage in training large language models without compromising performance.


Gap Completion in Point Cloud Scene occluded by Vehicles using SGC-Net

http://arxiv.org/abs/2407.08290v1

Compressor summary: The study proposes a novel approach using deep neural networks to fill gaps in urban 3D data caused by vehicle occlusions, generating realistic point cloud scenes.


Predicting Heart Failure with Attention Learning Techniques Utilizing Cardiovascular Data

http://arxiv.org/abs/2407.08289v1

Compressor summary: The paper proposes an attention learning-based method to predict heart failure using EHR data and different optimizers with varying learning rates, achieving better results than existing methods like LSTM.


WayveScenes101: A Dataset and Benchmark for Novel View Synthesis in Autonomous Driving

http://arxiv.org/abs/2407.08280v1

Compressor summary: WayveScenes101 is a dataset with 101 real-world driving scenes that challenge scene reconstruction methods with varying environmental and traffic conditions, and includes camera poses and metadata for evaluation.


Continually Learn to Map Visual Concepts to Large Language Models in Resource-constrained Environments

http://arxiv.org/abs/2407.08279v1

Compressor summary: Key points: - Continual learning from non-i.i.d. data is hard for deep visual models - Large language models contain rich knowledge and can help ground vision representations - Continual Visual Mapping (CVM) trains a small visual model to map into a fixed large language model's space - CVM outperforms other methods on five benchmarks and works well on limited resources Summary: Continual Visual Mapping (CVM) is an approach that uses a small visual model and a large language model to learn robust and generalizable vision representations from non-i.i.d. data, even on low-resource devices.


StixelNExT: Toward Monocular Low-Weight Perception for Object Segmentation and Free Space Detection

http://arxiv.org/abs/2407.08277v1

Compressor summary: The authors propose a new method for object segmentation using monocular images without manual annotations, leveraging the concept of Stixel-World to recognize multiple objects.


Explainability of Sub-Field Level Crop Yield Prediction using Remote Sensing

http://arxiv.org/abs/2407.08274v1

Compressor summary: The study develops and explains deep learning models for predicting crop yields using satellite images and other data, and shows how different temporal samplings and input modalities affect the models' performance and reliability.


RB-SQL: A Retrieval-based LLM Framework for Text-to-SQL

http://arxiv.org/abs/2407.08273v1

Compressor summary: The authors propose RB-SQL, a new framework that uses retrieval to improve LLMs' text-to-SQL performance by providing relevant schema and examples for in-context learning.


PowerYOLO: Mixed Precision Model for Hardware Efficient Object Detection with Event Data

http://arxiv.org/abs/2407.08272v1

Compressor summary: PowerYOLO is an energy-efficient object detection system that uses a novel sensor and mixed precision quantisation to achieve high accuracy while reducing memory and computational complexity.


Gaussian process interpolation with conformal prediction: methods and comparative analysis

http://arxiv.org/abs/2407.08271v1

Compressor summary: The article proposes conformal prediction methods for Gaussian process interpolation to improve calibration and uncertainty quantification without sacrificing accuracy.


LLMs' morphological analyses of complex FST-generated Finnish words

http://arxiv.org/abs/2407.08269v1

Compressor summary: The text evaluates the morphological analysis abilities of various language models, finding that GPT-4 is somewhat challenged while smaller models perform poorly.


Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

http://arxiv.org/abs/2407.08268v1

Compressor summary: CLIPtrase is a new method that improves semantic segmentation using CLIP by enhancing patch feature correlations without additional training, achieving state-of-the-art results.


Enhancing Thermal Infrared Tracking with Natural Language Modeling and Coordinate Sequence Generation

http://arxiv.org/abs/2407.08265v1

Compressor summary: The paper proposes NLMTrack, a novel natural language modeling-based approach for thermal infrared (TIR) tracking that enhances temporal and spatial information and outperforms previous methods on multiple benchmarks.


SALSA: Swift Adaptive Lightweight Self-Attention for Enhanced LiDAR Place Recognition

http://arxiv.org/abs/2407.08260v1

Compressor summary: SALSA is a new framework that uses radial window attention, self-attention, and a Mixer layer to perform efficient and accurate LiDAR place recognition for large-scale mapping and localization tasks.


Knowledge distillation to effectively attain both region-of-interest and global semantics from an image where multiple objects appear

http://arxiv.org/abs/2407.08257v1

Compressor summary: The paper proposes RveRNet, a novel architecture for classifying food images by segmenting the food region, using DeiTs for classification, and integrating both local and global contexts.


GraphMamba: An Efficient Graph Structure Learning Vision Mamba for Hyperspectral Image Classification

http://arxiv.org/abs/2407.08255v1

Compressor summary: GraphMamba is a new framework for hyperspectral image classification that combines efficient graph structure learning with linear spectral encoding to extract spatial-spectral features, outperforming existing methods on real datasets.


Gradient Boosting Reinforcement Learning

http://arxiv.org/abs/2407.08250v1

Compressor summary: The text introduces Gradient-Boosting RL (GBRL), a framework that extends the benefits of Gradient Boosting Trees (GBT) to reinforcement learning, improving interpretability and performance in domains with structured or categorical features.


Toward accessible comics for blind and low vision readers

http://arxiv.org/abs/2407.08248v1

Compressor summary: The authors propose a method to generate accurate text descriptions of comic stories using language models and computer vision techniques, which can then be converted into audiobooks and eBooks.


Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching

http://arxiv.org/abs/2407.08244v1

Compressor summary: The paper proposes a new regularization method for 3D shape matching using synchronous diffusion to achieve smooth correspondences and improve performance.


Generalized Face Anti-spoofing via Finer Domain Partition and Disentangling Liveness-irrelevant Factors

http://arxiv.org/abs/2407.08243v1

Compressor summary: The text describes a new face anti-spoofing technique that focuses on learning identity-invariant liveness representations and handling style shifts to achieve state-of-the-art performance across different datasets and scenarios.


Differentially Private Neural Network Training under Hidden State Assumption

http://arxiv.org/abs/2407.08233v1

Compressor summary: DP-SBCD is a new method for training private neural networks with provable guarantees, which handles non-convex problems and uses calibrated adaptive noise.


SwishReLU: A Unified Approach to Activation Functions for Enhanced Deep Neural Networks Performance

http://arxiv.org/abs/2407.08232v1

Compressor summary: SwishReLU is a new activation function that combines ReLU and Swish, improving performance over ReLU with lower computational cost than Swish, especially for image datasets.


E2VIDiff: Perceptual Events-to-Video Reconstruction using Diffusion Priors

http://arxiv.org/abs/2407.08231v1

Compressor summary: Event cameras capture brightness changes very accurately but converting their signals into videos is challenging; using diffusion models improves the quality and realism of reconstructed videos.


DALL-M: Context-Aware Clinical Data Augmentation with LLMs

http://arxiv.org/abs/2407.08227v1

Compressor summary: The text introduces DALL-M, a novel technique that uses large language models to generate patient contextual synthetic data for chest X-ray images, improving AI medical diagnostics by incorporating clinical tabular data and enhancing the model performance.


Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

http://arxiv.org/abs/2407.08223v1

Compressor summary: Speculative RAG combines a generalist and specialist LLM to quickly generate and verify diverse, accurate, and updated responses using multiple subsets of retrieved documents.


GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views

http://arxiv.org/abs/2407.08221v1

Compressor summary: GAURA is a generalizable neural rendering method that can synthesize high-quality 3D scenes from imperfect images under various degradations without test-time optimization.


Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

http://arxiv.org/abs/2407.08219v1

Compressor summary: The paper explores how large language models can generate contextually-relevant navigational instructions for blind and low-vision individuals in various scenarios, using a new dataset of images and goals.


Enhancing Performance and User Engagement in Everyday Stress Monitoring: A Context-Aware Active Reinforcement Learning Approach

http://arxiv.org/abs/2407.08215v1

Compressor summary: The paper presents a novel algorithm that uses active reinforcement learning and contextual data to improve stress detection efficiency and accuracy using PPG data from smartwatches.


Towards stable training of parallel continual learning

http://arxiv.org/abs/2407.08214v1

Compressor summary: The paper introduces a novel approach called Stable Parallel Continual Learning (SPCL) to enhance the training stability of PCL for both forward and backward propagation, using orthogonal techniques to manage gradients and optimize network parameters.


Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets

http://arxiv.org/abs/2407.08209v1

Compressor summary: This paper proposes a novel method to generate more informative and consistent synthetic data for curvilinear object segmentation, using textual features and a ControlNet with spatial adaptation.


System Report for CCL24-Eval Task 7: Multi-Error Modeling and Fluency-Targeted Pre-training for Chinese Essay Evaluation

http://arxiv.org/abs/2407.08206v1

Compressor summary: The authors describe their approaches and results for the CEFE task at CCL-2024, achieving first place by using back-translation, NSP-based strategy, and Symmetric Cross Entropy loss.


Chromosomal Structural Abnormality Diagnosis by Homologous Similarity

http://arxiv.org/abs/2407.08204v1

Compressor summary: The paper proposes a data-driven method that aligns homologous chromosomes and diagnoses structural abnormalities using their similarity, improving upon existing methods.


Deep Understanding of Soccer Match Videos

http://arxiv.org/abs/2407.08200v1

Compressor summary: Our system uses computer vision to analyze live soccer matches, generating highlights, graphics, and insights for viewers.


SRPose: Two-view Relative Pose Estimation with Sparse Keypoints

http://arxiv.org/abs/2407.08199v1

Compressor summary: SRPose is a sparse keypoint-based method for estimating relative camera or object pose transformation from two RGB images, achieving competitive performance and generalizability.


SoupLM: Model Integration in Large Language and Multi-Modal Models

http://arxiv.org/abs/2407.08196v1

Compressor summary: The study introduces SoupLM, a cost-efficient multimodal LLM assembled from different LLM variants with diverse training recipes, tasks, and data modalities.


A Text-to-Game Engine for UGC-Based Role-Playing Games

http://arxiv.org/abs/2407.08195v1

Compressor summary: The paper presents a text-to-game engine that uses foundation models to generate interactive RPGs from simple text inputs, demonstrating the potential of generative AI for transforming the game industry.


ARCO:Adaptive Multi-Agent Reinforcement Learning-Based Hardware/Software Co-Optimization Compiler for Improved Performance in DNN Accelerator Design

http://arxiv.org/abs/2407.08192v1

Compressor summary: ARCO is a co-optimizing compilation framework that uses MARL to improve the efficiency and performance of mapping machine learning models onto diverse hardware platforms.


fairBERTs: Erasing Sensitive Information Through Semantic and Fairness-aware Perturbations

http://arxiv.org/abs/2407.08189v1

Compressor summary: fairBERTs is a framework that uses a generative adversarial network to remove bias from pre-trained language models while preserving their utility for natural language processing tasks.


Position: Measure Dataset Diversity, Don't Just Claim It

http://arxiv.org/abs/2407.08188v1

Compressor summary: The text discusses how the lack of clear definitions for terms like "diversity" in machine learning datasets affects their creation and suggests ways to improve this process using social science principles.


ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation

http://arxiv.org/abs/2407.08187v1

Compressor summary: The authors propose ScaleDepth, a novel monocular depth estimation method that decomposes metric depth into scene scale and relative depth, achieving state-of-the-art performance across various scenes.


Beyond Text: Leveraging Multi-Task Learning and Cognitive Appraisal Theory for Post-Purchase Intention Analysis

http://arxiv.org/abs/2407.08182v1

Compressor summary: This study evaluates using users' self-expression and traits in multi-task learning frameworks to improve predictions of their behavior based on NLP, finding that it enhances understanding and prediction.


CoGS: Causality Constrained Counterfactual Explanations using goal-directed ASP

http://arxiv.org/abs/2407.08179v1

Compressor summary: The CoGS framework uses s(CASP) to generate realistic and causally consistent counterfactual explanations for rule-based machine learning models, helping users understand decisions and changes in input attributes that lead to desired outcomes.


Faster Machine Unlearning via Natural Gradient Descent

http://arxiv.org/abs/2407.08169v1

Compressor summary: The authors propose a new algorithm using Natural Gradient Descent for efficiently deleting data from trained machine learning models while maintaining strong privacy guarantees and improving generalization performance.


Synthetic Electroretinogram Signal Generation Using Conditional Generative Adversarial Network for Enhancing Classification of Autism Spectrum Disorder

http://arxiv.org/abs/2407.08166v1

Compressor summary: The text describes how artificial intelligence can be used with electroretinogram tests to study and classify neurodevelopmental disorders like autism spectrum disorder.


Hierarchical Consensus-Based Multi-Agent Reinforcement Learning for Multi-Robot Cooperation Tasks

http://arxiv.org/abs/2407.08164v1

Compressor summary: The paper proposes HC-MARL, a framework that uses contrastive learning to promote global consensus among agents without direct communication, enabling cooperative behavior in MARL tasks with dynamic requirements.


Improving Visual Place Recognition Based Robot Navigation Through Verification of Localization Estimates

http://arxiv.org/abs/2407.08162v1

Compressor summary: The proposed system improves visual place recognition for robot navigation by using a multi-layer perceptron integrity monitor that reduces errors and increases success rates in real-world experiments.


AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

http://arxiv.org/abs/2407.08156v1

Compressor summary: The study introduces Image Address Localization (IAL) problem for social media and photojournalism, proposing AddressCLIP, an end-to-end framework using contrastive learning and manifold learning to predict readable addresses from images, and creating three datasets for the task.


Lifelong Histopathology Whole Slide Image Retrieval via Distance Consistency Rehearsal

http://arxiv.org/abs/2407.08153v1

Compressor summary: The paper proposes a framework for lifelong whole slide image retrieval that addresses catastrophic forgetting by updating models on growing databases and maintaining distance consistency.


Enrich the content of the image Using Context-Aware Copy Paste

http://arxiv.org/abs/2407.08151v1

Compressor summary: Our method uses BLIP to extract content from source images, matches it with category information, and integrates it with SAM or YOLO for consistent Copy-Paste data augmentation in computer vision tasks.


Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

http://arxiv.org/abs/2407.08150v1

Compressor summary: The authors introduce a new dataset (SRI-ADV) with multi-modal data from different users watching advertisement videos, and propose a model (HMLLM) to analyze cognitive understanding of video content across demographics.


Deep Polarization Cues for Single-shot Shape and Subsurface Scattering Estimation

http://arxiv.org/abs/2407.08149v1

Compressor summary: The paper presents a new method to estimate the shape and subsurface scattering parameters of translucent objects using polarization cues and introduces a large-scale synthetic dataset for training.


SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning

http://arxiv.org/abs/2407.08148v1

Compressor summary: SCPNet is a novel unsupervised framework for homography estimation that combines intra-modal self-supervised learning, correlation-based estimation, and consistent feature map projection, achieving state-of-the-art performance on various datasets.


Looks can be Deceptive: Distinguishing Repetition Disfluency from Reduplication

http://arxiv.org/abs/2407.08147v1

Compressor summary: The paper introduces a new dataset and models for studying and classifying reduplication and repetition in Hindi, Telugu, and Marathi speech using computational linguistics.


Survey on Fundamental Deep Learning 3D Reconstruction Techniques

http://arxiv.org/abs/2407.08137v1

Compressor summary: The survey explores different deep learning methods for creating realistic 3D scenes and discusses their pros and cons.


EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions

http://arxiv.org/abs/2407.08136v1

Compressor summary: EchoMimic is a novel approach that generates portrait videos using audio input and facial landmarks, overcoming the limitations of previous methods and outperforming alternatives in various evaluations.


Highway Networks for Improved Surface Reconstruction: The Role of Residuals and Weight Updates

http://arxiv.org/abs/2407.08134v1

Compressor summary: The paper presents a new neural network called Square-Highway that improves surface reconstruction from point clouds in computer graphics and medical imaging, compared to existing methods.


Nonverbal Interaction Detection

http://arxiv.org/abs/2407.08133v1

Compressor summary: This paper presents NVI, a large-scale annotated dataset for nonverbal interaction detection, and proposes NVI-DEHR, a hypergraph model that captures high-order interactions for better interpretation of nonverbal signals.


DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing

http://arxiv.org/abs/2407.08132v1

Compressor summary: The DMM model uses disparity information, multi-scale target-aware attention, and target-prior aware auxiliary task to improve multispectral oriented object detection while reducing computational complexity.


Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment

http://arxiv.org/abs/2407.08127v1

Compressor summary: The paper proposes a novel black-box model inversion attack method that uses a Prediction Alignment Encoder to map the target model's output into StyleGAN latent space, improving attack accuracy and reducing query numbers significantly.


Label-anticipated Event Disentanglement for Audio-Visual Video Parsing

http://arxiv.org/abs/2407.08126v1

Compressor summary: The paper introduces LEAP, a decoding method that uses label texts to disentangle overlapping events in audio-visual parsing, along with a semantic similarity loss function to improve performance and interpretability.


Real-Time Summarization of Twitter

http://arxiv.org/abs/2407.08125v1

Compressor summary: The paper presents a real-time summarization system for Twitter that uses Dirichlet score and classifies tweets based on interest profiles, achieving good performance with metrics like MAP, CG, and DCG.


Improving Dental Diagnostics: Enhanced Convolution with Spatial Attention Mechanism

http://arxiv.org/abs/2407.08114v1

Compressor summary: The paper introduces a deep learning model that improves dental image analysis by combining ResNet50 with SimAM attention module, achieving superior performance and accuracy compared to traditional architectures.


FYI: Flip Your Images for Dataset Distillation

http://arxiv.org/abs/2407.08113v1

Compressor summary: The paper presents dataset distillation techniques to synthesize realistic images from large datasets, addressing the limitation of bilateral equivalence that affects object recognition.


How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities

http://arxiv.org/abs/2407.08112v1

Compressor summary: The text discusses the challenges of modelling long sequences with deep neural networks, evaluates recent advances in this area, and highlights the need for more research on the extrapolation capabilities of different inductive biases.


Advanced Meta-Ensemble Machine Learning Models for Early and Accurate Sepsis Prediction to Improve Patient Outcomes

http://arxiv.org/abs/2407.08107v1

Compressor summary: The paper proposes using machine learning techniques to predict sepsis onset, and evaluates them using various metrics, finding the meta-ensemble model to be the most accurate.


Automata-based constraints for language model decoding

http://arxiv.org/abs/2407.08103v1

Compressor summary: The authors propose a method using automata theory to improve the output of language models for regular languages, which have many practical applications, while being efficient and easy to implement.


Live Fitness Coaching as a Testbed for Situated Interaction

http://arxiv.org/abs/2407.08101v1

Compressor summary: The QEVD benchmark and dataset explore how vision-language models can handle open-ended, asynchronous fitness coaching interactions by recognizing actions, identifying mistakes, and providing feedback.


Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

http://arxiv.org/abs/2407.08100v1

Compressor summary: The paper proves that popular adaptive optimization methods like Adam do not converge to any random limit point if learning rates are far from zero, even though they work well in practice.


How does Burrows' Delta work on medieval Chinese poetic texts?

http://arxiv.org/abs/2407.08099v1

Compressor summary: The study shows that Burrows' Delta method can effectively attribute authorship in Chinese poetry by identifying similarities among texts from the same poet and distinguishing them from other poets.