This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-12 generated by the compressor, my personal LLM-based project.
http://arxiv.org/abs/2407.08739v1
Compressor summary: MAVIS is a new paradigm to improve large language models' mathematical problem-solving skills in visual contexts by providing specialized data, vision encoders, and instruction tuning.
http://arxiv.org/abs/2407.08737v1
Compressor summary: Key points: - The paper proposes using pre-trained reward models to adapt video diffusion models for specific tasks - The reward models contain dense gradient information that helps learn efficiently in complex search spaces - The approach outperforms gradient-free methods in terms of reward queries and computation Summary: The paper presents a method to improve video diffusion models by using pre-trained reward models with rich gradients, achieving more efficient learning than gradient-free alternatives.
http://arxiv.org/abs/2407.08734v1
Compressor summary: The text discusses challenges in measuring the performance of subgraphs (circuits) in neural networks, emphasizing the need for clarity and better methods in mechanistic interpretability work.
http://arxiv.org/abs/2407.08733v1
Compressor summary: MATHCHECK is a tool that evaluates the mathematical reasoning ability of large language models across diverse tasks, providing a better reflection of their true intelligence.
http://arxiv.org/abs/2407.08729v1
Compressor summary: The paper proposes BiEquiformer, a bi-equivariant deep learning pipeline for global point cloud registration that fuses information from both point clouds using expressive layers, achieving superior performance in robust settings.
http://arxiv.org/abs/2407.08726v1
Compressor summary: MIA is a data engine that uses Mapillary and OpenStreetMap to create a scalable and diverse dataset for predicting bird's eye view maps from first-person view images, improving autonomous navigation performance.
http://arxiv.org/abs/2407.08725v1
Compressor summary: MetaUrban is a simulation platform for testing embodied AI in urban spaces, improving generalizability and safety of mobile agents.
http://arxiv.org/abs/2407.08723v1
Compressor summary: The paper introduces new topology-based complexity measures that correlate with generalization error in deep neural networks without assuming continuous-time training dynamics or restrictive geometric assumptions.
http://arxiv.org/abs/2407.08717v1
Compressor summary: WhisperNetV2 is a novel deep learning network for lip-based biometric authentication that considers emotions and uses SlowFast networks to extract behavioral and physiological features, achieving state-of-the-art performance.
http://arxiv.org/abs/2407.08716v1
Compressor summary: The text discusses data contamination in large language models, its types, and its effect on downstream tasks like summarization and question answering.
http://arxiv.org/abs/2407.08715v1
Compressor summary: The proposed approach uses early exit classifiers with partial sensor windows to minimize energy consumption while maintaining accuracy in time-series applications, enabling significant energy savings and allowing for remote use in limited energy situations.
http://arxiv.org/abs/2407.08713v1
Compressor summary: Key points: - The text proposes GTA, a benchmark for General Tool Agents that simulates real-world scenarios with real user queries, real deployed tools, and real multimodal inputs. - Current evaluations of LLMs' tool-use capabilities are not effective in revealing their problem-solving abilities. - Existing LLMs perform poorly on GTA tasks, indicating bottlenecks in their tool-use capabilities. Summary: The authors introduce GTA, a realistic benchmark for evaluating LLMs' ability to use tools in various scenarios, and show that current LLMs struggle with many real-world tasks.
http://arxiv.org/abs/2407.08711v1
Compressor summary: OmniNOCS is a large dataset with diverse object classes and annotations that enables training of a novel 3D detection model with shape and segmentation information, which achieves comparable results to state-of-the-art methods.
http://arxiv.org/abs/2407.08707v1
Compressor summary: This paper studies how vision-language models can memorize and regurgitate sensitive information from training documents, and proposes a method to prevent this privacy risk.
http://arxiv.org/abs/2407.08706v1
Compressor summary: HiRes-LLaVA is a framework that efficiently processes high-resolution inputs without losing contextual and geometric information by using a SliceRestore adapter and a Self-Mining Sampler.
http://arxiv.org/abs/2407.08701v1
Compressor summary: Live2Diff is a new video diffusion model with uni-directional temporal attention for live streaming video translation that ensures temporal consistency and smoothness and can process videos at interactive framerates.
http://arxiv.org/abs/2407.08699v1
Compressor summary: The text proposes a new method called Branch-and-Merge (BaM) for adapting large language models to different languages, which reduces forgetting of the source domain and maintains or improves target domain performance.
http://arxiv.org/abs/2407.08689v1
Compressor summary: This text discusses the need for regulatory frameworks to ensure trustworthy and ethical AI tools, and provides an accessible overview of existing literature on operationalizing these principles, highlighting gaps and trade-offs between guidelines and current AI research.
http://arxiv.org/abs/2407.08683v1
Compressor summary: The proposed method SEED-Story generates extended multimodal stories using a Multimodal Large Language Model, predicting both text and visual tokens, and employing an efficient autoregressive mechanism with a new dataset called StoryStream.
http://arxiv.org/abs/2407.08680v1
Compressor summary: GIMM is a novel approach to motion modeling for VFI that uses spatiotemporal motion latent from bidirectional flows and implicit prediction of optical flows via an adaptive neural network, achieving better performance than existing methods.
http://arxiv.org/abs/2407.08675v1
Compressor summary: The paper proposes a method to improve the feasibility of designs generated by text-to-image models by using CAD images as prompts, and evaluates it on bike design examples with Stable Diffusion 2.1.
http://arxiv.org/abs/2407.08674v1
Compressor summary: Still-Moving is a framework to customize text-to-video models without needing customized video data, using spatial adapters trained on frozen videos and a motion adapter module.
http://arxiv.org/abs/2407.08672v1
Compressor summary: The paper proposes NODE-Adapter, a novel method using Neural ODEs for better vision-language reasoning by constructing and optimizing cross-modal prototypes for downstream tasks.
http://arxiv.org/abs/2407.08669v1
Compressor summary: The paper proposes an attention mechanism guided by image segmentation for visual question answering in remote sensing, and introduces a new dataset with high-resolution images and questions/answers.
http://arxiv.org/abs/2407.08662v1
Compressor summary: The paper proposes a new uncertainty estimation method for natural language generation in healthcare using large language models, which generates explanations and verification questions to detect inconsistencies and measure uncertainty.
http://arxiv.org/abs/2407.08659v1
Compressor summary: The text introduces a new way to control the quality and variety of data generated by deep models using a metric called pseudo density, which allows adjustments for individual samples and different techniques for enhancing fidelity or diversity.
http://arxiv.org/abs/2407.08649v1
Compressor summary: The paper explores a simple method for estimating model accuracy when ground truth labels are unavailable, compares it with other methods, and shows that it often outperforms them.
http://arxiv.org/abs/2407.08642v1
Compressor summary: The paper introduces Specialized Generalist Artificial Intelligence (SGAI), a milestone toward Artificial General Intelligence, which combines human-level expertise in specific tasks with general abilities, and proposes a framework for developing it.
http://arxiv.org/abs/2407.08641v1
Compressor summary: Increasing data can harm dynamics learning in reservoir computing due to instability from delayed states, but using regularization or noise can help mitigate it.
http://arxiv.org/abs/2407.08640v1
Compressor summary: The paper introduces a novel framework that trains a modality-agnostic face recognition model using automatic routing mechanism, enabling it to handle multiple modalities without explicit target label knowledge.
http://arxiv.org/abs/2407.08639v1
Compressor summary: The paper introduces a dynamic method to adjust the trade-off parameter in Direct Preference Optimization, which improves the alignment of large language models with human feedback based on data quality.
http://arxiv.org/abs/2407.08634v1
Compressor summary: The text introduces RTMW, a series of high-performance models for 2D/3D whole-body pose estimation that capture pose information from different body parts with various scales and achieve strong performance on multiple benchmarks.
http://arxiv.org/abs/2407.08633v1
Compressor summary: The AI-driven framework generates optimal warehouse layouts based on spatial constraints, functional requirements, and accessibility criteria.
http://arxiv.org/abs/2407.08632v1
Compressor summary: The paper analyzes the generalization errors in Byzantine-resilient decentralized learning algorithms and shows that they cannot be fully eliminated due to malicious agents.
http://arxiv.org/abs/2407.08626v1
Compressor summary: RoboMorph is an automated method that uses large language models and evolutionary algorithms to generate and optimize modular robot designs efficiently.
http://arxiv.org/abs/2407.08623v1
Compressor summary: The paper analyzes the limitations of common metrics for high-dimensional comparisons and introduces a new dimension insensitive metric, DIEM, which overcomes these limitations for better interpretability and accuracy.
http://arxiv.org/abs/2407.08618v1
Compressor summary: The paper explores the text processing aspects of Language Computing, discussing advancements in deep learning, computational resources, linguistic annotation, and practical applications for languages like Tamil, emphasizing the need for more research collaboration and digitization.
http://arxiv.org/abs/2407.08608v1
Compressor summary: The paper proposes FlashAttention-3, which improves the speed and accuracy of attention in large language models on Hopper GPUs using three techniques: warp-specialization, interleaved matmul and softmax operations, and block quantization and incomerent processing.
http://arxiv.org/abs/2407.08607v1
Compressor summary: The paper presents a new empathy detection method that uses six psychological indicators and improves empathy prediction with a large language model and fine-tuning, ranking 7th in a shared task.
http://arxiv.org/abs/2407.08590v1
Compressor summary: The text reviews nine popular simulation frameworks for reinforcement learning (RL) research, comparing them based on various criteria and highlighting their strengths and weaknesses.
http://arxiv.org/abs/2407.08583v1
Compressor summary: Key points: - Large language models (LLMs) and multi-modal language models (MLLMs) are powerful tools for various applications. - Data is crucial for the development of LLMs and MLLMs, and data and models co-develop each other. - The paper reviews existing works related to MLLMs from the data-model co-development perspective. Summary: The paper explores how large and multi-modal language models rely on data and co-develop with it for various applications.
http://arxiv.org/abs/2407.08582v1
Compressor summary: The paper explores if a universal truthfulness hyperplane can be found within LLMs to distinguish factual correct and incorrect outputs across different tasks and domains using diverse datasets.
http://arxiv.org/abs/2407.08572v1
Compressor summary: The paper proposes a new post-train Dual Bayesian strategy to improve adversarial transferability in skeleton-based human activity recognition models by smoothing the loss landscape and crafting adversarial examples along motion dynamics.
http://arxiv.org/abs/2407.08571v1
Compressor summary: The text proposes Multi-Group Proportional Representation (MPR), a new metric to measure and improve representation of intersectional groups in image search and retrieval tasks, addressing the limitations of existing methods that only consider single or binary attributes.
http://arxiv.org/abs/2407.08569v1
Compressor summary: LiSe is a new method that uses LiDAR data and 2D images for unsupervised 3D object detection, improving performance by adaptively sampling and aggregating weak models.
http://arxiv.org/abs/2407.08567v1
Compressor summary: The paper introduces APA, a versatile activation function that adapts to the data distribution and improves performance in both balanced and imbalanced tasks.
http://arxiv.org/abs/2407.08564v1
Compressor summary: This study explores how large language models' career interests and competencies vary with language changes and model advancements, revealing their human-like tendencies and potential implications for integrating them into professional environments.
http://arxiv.org/abs/2407.08563v1
Compressor summary: This study finds that GPT-3.5, a large language model, does not accurately predict vote choice in Germany based on synthetic samples generated from survey data, highlighting the limitations of using LLMs for public opinion estimation.
http://arxiv.org/abs/2407.08561v1
Compressor summary: Our neural re-localization method using transformers improves autonomous driving by registering navigation maps to visual features, providing accurate and fast localization without HD maps.
http://arxiv.org/abs/2407.08558v1
Compressor summary: Key points: - TFE is important for intelligent traffic systems but traditional methods are costly and limited in coverage - Cloud computing and vehicular network data offer a promising alternative - ST-Mamba is a deep learning model that combines CNN with Mamba framework to enhance TFE accuracy and stability - ST-Mamba uses minimal data and achieves precise and stable TFE results Summary: ST-Mamba, a deep learning model that integrates CNN and Mamba framework, improves traffic flow estimation (TFE) accuracy and stability using minimal vehicular network data in a cost-effective way.
http://arxiv.org/abs/2407.08554v1
Compressor summary: The text discusses the gap between AI and clinical practice in medicine, suggesting new evaluation methods that involve patients and clinicians to improve AI's impact on healthcare.
http://arxiv.org/abs/2407.08551v1
Compressor summary: MELLE is a new TTS method that generates mel-spectrograms directly from text using continuous tokens and regression loss, improving fidelity and diversity over discrete codec models.
http://arxiv.org/abs/2407.08550v1
Compressor summary: The paper presents a method to integrate large language models into automated production systems, enhancing task automation and flexibility by using digital twins and microservices.
http://arxiv.org/abs/2407.08546v1
Compressor summary: The paper proposes a new evaluation metric, VCS, that correlates saliency maps of deep learning classifiers for Alzheimer's disease with brain volume changes across different regions, improving the understanding of the model's decisions process.
http://arxiv.org/abs/2407.08536v1
Compressor summary: The paper introduces Learnable Drift Compensation (LDC), a method to mitigate semantic drift in prototype-based continual learning and achieve state-of-the-art performance in supervised and semi-supervised settings.
http://arxiv.org/abs/2407.08526v1
Compressor summary: The paper proposes BLOS-BEV, a model that combines onboard camera visual information and SD maps to extend the perception range of autonomous vehicles to 200 meters for better scene understanding and planning.
http://arxiv.org/abs/2407.08521v1
Compressor summary: Key points: - VLMs like CLIP are powerful but don't model hierarchical text structure for images - Existing methods need costly training and don't use foundation models - Foundation models have emergent understanding of visual-semantic hierarchies - Radial Embedding framework probes and optimizes hierarchical understanding - HierarCaps dataset is a benchmark for studying image-text hierarchies - Foundation models outperform prior models in zero-shot hierarchical understanding - Text-only fine-tuning improves alignment to hierarchical reasoning Summary: The authors propose a framework and a dataset to study and improve hierarchical understanding of images and texts in foundation models like CLIP, which already have emergent knowledge of visual-semantic hierarchies.
http://arxiv.org/abs/2407.08517v1
Compressor summary: The paper proposes a generalized low-rank matrix completion model using overlapping group error representation to better capture global and local structure information of real data.
http://arxiv.org/abs/2407.08516v1
Compressor summary: The article discusses how connectionist and symbolic AI are converging in large language models like ChatGPT, which enable autonomous agents with enhanced reasoning and decision-making capabilities compared to knowledge graphs.
http://arxiv.org/abs/2407.08515v1
Compressor summary: The paper introduces FaceCaption-15M, a large dataset of facial images with natural language descriptions, to facilitate research on face-centered tasks and achieve state-of-the-art results on two challenging tasks using FLIP-based models.
http://arxiv.org/abs/2407.08513v1
Compressor summary: The paper discusses different methods to fine-tune Stable Diffusion XL for generating high-quality 2D icons and emphasizes the importance of defining "high-quality" in a commercial setting, as well as the limitations of FID and CLIP scores for evaluating icon quality.
http://arxiv.org/abs/2407.08507v1
Compressor summary: The paper proposes a novel self-supervised framework using vision-language models to estimate remote physiological signals from facial videos, improving performance over existing methods.
http://arxiv.org/abs/2407.08500v1
Compressor summary: Conda is a new method that improves dynamic graph learning by generating better embeddings for target nodes using latent diffusion and Variational Auto-Encoder techniques.
http://arxiv.org/abs/2407.08498v1
Compressor summary: The paper proposes a new Retinex model for image denoising that uses non-convex regularization and weak space oscillation, and demonstrates its effectiveness in removing noise from images.
http://arxiv.org/abs/2407.08497v1
Compressor summary: The paper introduces a method to explain and change the strength of arguments in Quantitative Bipolar Argumentation Frameworks using counterfactual explanations.
http://arxiv.org/abs/2407.08495v1
Compressor summary: The authors investigate if large language models can be used as voting advice applications in the European Parliament elections and explore ways to improve their performance, finding that MIXTRAL is highly accurate and expert-curated information boosts accuracy by 9%.
http://arxiv.org/abs/2407.08489v1
Compressor summary: The paper introduces a new way to detect objects with orientation, using points and axes, that improves performance and avoids some common problems.
http://arxiv.org/abs/2407.08488v1
Compressor summary: LYNX is a state-of-the-art hallucination detection model that can reason well in real-world scenarios and outperforms other models on the new HaluBench benchmark.
http://arxiv.org/abs/2407.08484v1
Compressor summary: Key points: - The paper proposes a deep learning method to position skeleton joints in 3D human body models - It uses synthetic samples and input points with normal vectors - It outperforms the state-of-the-art with simpler architecture and faster processing times Summary: The paper presents a simple and fast deep learning approach to locate skeleton joints in 3D human models using synthetic data and point features.
http://arxiv.org/abs/2407.08479v1
Compressor summary: RobustGANTT is a GNN-based scheduler that efficiently and adaptively computes carrier schedules for battery-free sensor tags in large-scale IoT networks.
http://arxiv.org/abs/2407.08476v1
Compressor summary: VideoMamba is an efficient and effective video recognition model that uses Mamba's linear complexity and selective SSM mechanism to capture spatial and temporal information in videos.
http://arxiv.org/abs/2407.08475v1
Compressor summary: This paper reviews public fine-tuning datasets for large models, focusing on their construction techniques and methods to provide a comprehensive overview and guide future research.
http://arxiv.org/abs/2407.08470v1
Compressor summary: The paper proposes a 3D-UNet model with a Context Transformer to segment brain tumors in MRI scans, achieving high accuracy compared to existing methods.
http://arxiv.org/abs/2407.08464v1
Compressor summary: The text introduces a new unsupervised GCRL method called TLDR, which uses temporal distance to guide exploration and goal-reaching in complex robotic environments.
http://arxiv.org/abs/2407.08460v1
Compressor summary: The paper reviews 27 state-of-the-art semi-supervised object detection methods that use a mix of labeled and unlabeled data to improve performance and reduce dependence on expensive labeled datasets.
http://arxiv.org/abs/2407.08458v1
Compressor summary: The paper proposes a DRL-based optimization method for reducing energy consumption and AoI in NR-V2X communication by employing interference cancellation with NOMA technology.
http://arxiv.org/abs/2407.08457v1
Compressor summary: The Neural Poisson Solver is a framework for blending visual signals represented by Implicit Neural Representations (INRs), using a continuous variational problem-solving approach to achieve natural, distortion-free results.
http://arxiv.org/abs/2407.08454v1
Compressor summary: The paper proposes KVMerger, a novel technique to compress KV cache for large language models in long-context scenarios without losing information or degrading performance under constrained memory budgets.
http://arxiv.org/abs/2407.08448v1
Compressor summary: ALISE is a novel method that produces aligned latent representations for satellite remote sensing imagery using spatial, spectral, and temporal dimensions, improving performance in crop segmentation and change detection tasks.
http://arxiv.org/abs/2407.08447v1
Compressor summary: WildGaussians is a novel method that combines robust features, appearance modeling, and 3D Gaussian Splatting to handle occlusions and appearance changes in 3D scene reconstruction, achieving state-of-the-art results with real-time rendering speed.
http://arxiv.org/abs/2407.08443v1
Compressor summary: The paper introduces "Infinite Motion", a novel approach for creating long and high-quality motion sequences from text, with applications in editing and splicing.
http://arxiv.org/abs/2407.08442v1
Compressor summary: The paper presents a new framework for classifying time-series imputation methods using deep learning, especially for clinical data, and discusses how to choose the best approach depending on the data properties and missingness scenarios.
http://arxiv.org/abs/2407.08441v1
Compressor summary: This study examines the biases in large language models and explores how prompt engineering can reveal them, highlighting the need for better mitigation techniques for a fairer AI.
http://arxiv.org/abs/2407.08440v1
Compressor summary: The paper clarifies the concept of rule-following, creates RuleBench benchmark to evaluate LLMs' rule-following abilities, and finds that current LLMs are limited in following rules.
http://arxiv.org/abs/2407.08434v1
Compressor summary: Using synthetic load profiles and transfer learning techniques improves the forecast accuracy of power consumption in the absence of historical data.
http://arxiv.org/abs/2407.08432v1
Compressor summary: The text describes a new algorithm (SG-RCPS) that improves uncertainty quantification in radiotherapy planning using magnetic resonance-guided linear accelerators, by providing prediction intervals for multiple subgroups with unknown membership at test time.
http://arxiv.org/abs/2407.08428v1
Compressor summary: The text surveys recent progress in human video generation using generative models, discussing methods for text-driven, audio-driven, and pose-driven motion generation, and reviewing datasets, evaluation metrics, and future research directions.
http://arxiv.org/abs/2407.08418v1
Compressor summary: PredBench is a benchmark for evaluating spatio-temporal prediction networks with diverse datasets and methods, offering comprehensive insights into their performance.
http://arxiv.org/abs/2407.08417v1
Compressor summary: Topic modeling uses BERTopic, a method that needs careful hyperparameter tuning and can reveal thematic similarities between countries on Covid-19 fake news.
http://arxiv.org/abs/2407.08415v1
Compressor summary: The variational SSM combines variational autoencoders and state space models to enable parallel training and generation for autoregressive sequence modeling.
http://arxiv.org/abs/2407.08414v1
Compressor summary: The paper proposes a new method to create realistic human avatars from multi-view videos using physics-based rendering and neural networks, overcoming the limitations of existing methods based on neural radiance fields.
http://arxiv.org/abs/2407.08411v1
Compressor summary: CLEO is a new continual learning setting for intelligent systems to adapt to evolving ontologies, while existing methods struggle with it.
http://arxiv.org/abs/2407.08410v1
Compressor summary: This paper shows how a specially trained AI model can outperform existing models and junior ophthalmologists in writing reports about age-related macular degeneration, demonstrating the potential of AI for medical diagnosis and care.
http://arxiv.org/abs/2407.08400v1
Compressor summary: The authors explore how language models can improve arithmetic reasoning without new data by using automated feedback from their predictions (self-training), achieving better results than supervised methods in online self-training.
http://arxiv.org/abs/2407.08395v1
Compressor summary: The paper investigates using neural networks and a modified metric to automatically detect paddle strokes in canoe sprint training sessions.
http://arxiv.org/abs/2407.08394v1
Compressor summary: Diff-Tracker uses a pre-trained text-to-image diffusion model to learn a prompt for unsupervised visual tracking and updates it online as the target moves.
http://arxiv.org/abs/2407.08388v1
Compressor summary: The paper discusses the theoretical basis of assigning confidence levels to Large Language Models (LLMs) and argues that their existence as belief systems is plausible but uncertain, while their attribution may be inaccurate due to experimental limitations.
http://arxiv.org/abs/2407.08380v1
Compressor summary: The authors propose using digital twins with CARLA simulator to generate a large dataset for accurate vision-based speed estimation, and achieve a low mean absolute error of under 3 km/h.
http://arxiv.org/abs/2407.08377v1
Compressor summary: The paper introduces a new dataset and method for handling atmospheric turbulence in long-range imaging, using dynamic and static priors to mitigate severe distortions effectively.
http://arxiv.org/abs/2407.08374v1
Compressor summary: The paper introduces OrthCR, a method to finetune vision-language models like CLIP for specific tasks by injecting orthogonal matrices into the transformer architecture and using cross-regularization to enhance robustness, stability, and generalization.
http://arxiv.org/abs/2407.08364v1
Compressor summary: SFTD is a new tool for comparing multi-scale topology of scalar functions defined on graphs or Euclidean spaces, improving applications in 3D computer vision such as cellular shape reconstruction and error detection.
http://arxiv.org/abs/2407.08362v1
Compressor summary: The paper proposes a new spiking neural network method for chronic lower back pain classification using an adaptive encoder and an ensemble of recurrent neural networks, achieving superior performance over traditional methods.
http://arxiv.org/abs/2407.08356v1
Compressor summary: The paper reviews FPGA-based solutions for processing event camera data in various applications, highlighting their advantages in energy efficiency and real-time performance.
http://arxiv.org/abs/2407.08351v1
Compressor summary: The paper proposes AutoBencher, a tool that creates novel and difficult benchmarks for language models by searching for datasets satisfying salience, novelty, and difficulty criteria.
http://arxiv.org/abs/2407.08349v1
Compressor summary: The paper presents a GUI for precise spinal screw placement using vertebrae segmentation from X-Ray images, improving preoperative planning and intra-operative guidance.
http://arxiv.org/abs/2407.08348v1
Compressor summary: The paper explores how data scaling affects mathematical reasoning in large language models and introduces the Skywork-Math series, which outperforms GPT-4 on certain benchmarks.
http://arxiv.org/abs/2407.08341v1
Compressor summary: Key points: - Paper proposes a deep feature extractor for iris recognition at arbitrary resolutions - Resolution degradation reduces recognition performance of deep learning models - Method of resolution-adaptive feature extraction with automatically switching networks improves robustness and performance - Framework includes resolution expert modules specialized for different degradations - Lower-resolution experts are trained by knowledge-distillation from high-resolution expert Summary: The paper presents a method that adapts to different image resolutions and degradations for iris recognition using specialized networks that switch according to the input condition.
http://arxiv.org/abs/2407.08340v1
Compressor summary: SLRL is a novel framework for Multi-View Clustering that leverages both complementary and structural information among samples to improve clustering outcomes.
http://arxiv.org/abs/2407.08333v1
Compressor summary: Key points: - SR-Mamba is a novel attention-free model for surgical phase recognition - It uses a bidirectional Mamba decoder to model long-distance temporal relationships in overlong sequences - It enables single-step neural network training, improving accuracy and simplifying the process - It achieves state-of-the-art performance on two datasets Summary: SR-Mamba is a new attention-free model that uses a bidirectional Mamba decoder to accurately recognize surgical phases in overlong sequences with single-step neural network training.
http://arxiv.org/abs/2407.08330v1
Compressor summary: The Hierarchical Document Transformer (HDT) is a new sparse Transformer architecture that uses document structure to improve efficiency and performance for tasks like science, law, or medicine.
http://arxiv.org/abs/2407.08328v1
Compressor summary: The study uses natural language processing to analyse maternity care incidents, revealing disparities among ethnic groups and stressing the importance of data analysis for improving quality and equity.
http://arxiv.org/abs/2407.08324v1
Compressor summary: The paper introduces and applies a new metric for measuring similarity between Markov decision processes, which can help improve transfer learning in reinforcement learning.
http://arxiv.org/abs/2407.08322v1
Compressor summary: The paper introduces I-SIRch:CS, a framework that uses AI to analyse safety incident reports in healthcare, identify patterns, and improve safety by learning from past incidents.
http://arxiv.org/abs/2407.08313v1
Compressor summary: The paper studies how different aspects of Geometric Graph Neural Networks affect the accuracy, efficiency, and symmetry preservation of 3D atomic systems for molecular modeling.
http://arxiv.org/abs/2407.08303v1
Compressor summary: Perceptual Fusion is a method to create a large image-text dataset by fusing diverse perception experts and an efficient multimodal language model, improving comprehensive visual perception for existing models.
http://arxiv.org/abs/2407.08302v1
Compressor summary: The paper presents two refined and new impact measures for argumentation, compares them with gradual semantics, and analyzes their performance.
http://arxiv.org/abs/2407.08298v1
Compressor summary: The authors propose an AI-based method to select and design vegetation indices for monitoring crop growth using multispectral satellite data and deep neural networks.
http://arxiv.org/abs/2407.08296v1
Compressor summary: Q-Galore is a novel method that combines quantization and low-rank projection to reduce memory usage in training large language models without compromising performance.
http://arxiv.org/abs/2407.08290v1
Compressor summary: The study proposes a novel approach using deep neural networks to fill gaps in urban 3D data caused by vehicle occlusions, generating realistic point cloud scenes.
http://arxiv.org/abs/2407.08289v1
Compressor summary: The paper proposes an attention learning-based method to predict heart failure using EHR data and different optimizers with varying learning rates, achieving better results than existing methods like LSTM.
http://arxiv.org/abs/2407.08280v1
Compressor summary: WayveScenes101 is a dataset with 101 real-world driving scenes that challenge scene reconstruction methods with varying environmental and traffic conditions, and includes camera poses and metadata for evaluation.
http://arxiv.org/abs/2407.08279v1
Compressor summary: Key points: - Continual learning from non-i.i.d. data is hard for deep visual models - Large language models contain rich knowledge and can help ground vision representations - Continual Visual Mapping (CVM) trains a small visual model to map into a fixed large language model's space - CVM outperforms other methods on five benchmarks and works well on limited resources Summary: Continual Visual Mapping (CVM) is an approach that uses a small visual model and a large language model to learn robust and generalizable vision representations from non-i.i.d. data, even on low-resource devices.
http://arxiv.org/abs/2407.08277v1
Compressor summary: The authors propose a new method for object segmentation using monocular images without manual annotations, leveraging the concept of Stixel-World to recognize multiple objects.
http://arxiv.org/abs/2407.08274v1
Compressor summary: The study develops and explains deep learning models for predicting crop yields using satellite images and other data, and shows how different temporal samplings and input modalities affect the models' performance and reliability.
http://arxiv.org/abs/2407.08273v1
Compressor summary: The authors propose RB-SQL, a new framework that uses retrieval to improve LLMs' text-to-SQL performance by providing relevant schema and examples for in-context learning.
http://arxiv.org/abs/2407.08272v1
Compressor summary: PowerYOLO is an energy-efficient object detection system that uses a novel sensor and mixed precision quantisation to achieve high accuracy while reducing memory and computational complexity.
http://arxiv.org/abs/2407.08271v1
Compressor summary: The article proposes conformal prediction methods for Gaussian process interpolation to improve calibration and uncertainty quantification without sacrificing accuracy.
http://arxiv.org/abs/2407.08269v1
Compressor summary: The text evaluates the morphological analysis abilities of various language models, finding that GPT-4 is somewhat challenged while smaller models perform poorly.
http://arxiv.org/abs/2407.08268v1
Compressor summary: CLIPtrase is a new method that improves semantic segmentation using CLIP by enhancing patch feature correlations without additional training, achieving state-of-the-art results.
http://arxiv.org/abs/2407.08265v1
Compressor summary: The paper proposes NLMTrack, a novel natural language modeling-based approach for thermal infrared (TIR) tracking that enhances temporal and spatial information and outperforms previous methods on multiple benchmarks.
http://arxiv.org/abs/2407.08260v1
Compressor summary: SALSA is a new framework that uses radial window attention, self-attention, and a Mixer layer to perform efficient and accurate LiDAR place recognition for large-scale mapping and localization tasks.
http://arxiv.org/abs/2407.08257v1
Compressor summary: The paper proposes RveRNet, a novel architecture for classifying food images by segmenting the food region, using DeiTs for classification, and integrating both local and global contexts.
http://arxiv.org/abs/2407.08255v1
Compressor summary: GraphMamba is a new framework for hyperspectral image classification that combines efficient graph structure learning with linear spectral encoding to extract spatial-spectral features, outperforming existing methods on real datasets.
http://arxiv.org/abs/2407.08250v1
Compressor summary: The text introduces Gradient-Boosting RL (GBRL), a framework that extends the benefits of Gradient Boosting Trees (GBT) to reinforcement learning, improving interpretability and performance in domains with structured or categorical features.
http://arxiv.org/abs/2407.08248v1
Compressor summary: The authors propose a method to generate accurate text descriptions of comic stories using language models and computer vision techniques, which can then be converted into audiobooks and eBooks.
http://arxiv.org/abs/2407.08244v1
Compressor summary: The paper proposes a new regularization method for 3D shape matching using synchronous diffusion to achieve smooth correspondences and improve performance.
http://arxiv.org/abs/2407.08243v1
Compressor summary: The text describes a new face anti-spoofing technique that focuses on learning identity-invariant liveness representations and handling style shifts to achieve state-of-the-art performance across different datasets and scenarios.
http://arxiv.org/abs/2407.08233v1
Compressor summary: DP-SBCD is a new method for training private neural networks with provable guarantees, which handles non-convex problems and uses calibrated adaptive noise.
http://arxiv.org/abs/2407.08232v1
Compressor summary: SwishReLU is a new activation function that combines ReLU and Swish, improving performance over ReLU with lower computational cost than Swish, especially for image datasets.
http://arxiv.org/abs/2407.08231v1
Compressor summary: Event cameras capture brightness changes very accurately but converting their signals into videos is challenging; using diffusion models improves the quality and realism of reconstructed videos.
http://arxiv.org/abs/2407.08227v1
Compressor summary: The text introduces DALL-M, a novel technique that uses large language models to generate patient contextual synthetic data for chest X-ray images, improving AI medical diagnostics by incorporating clinical tabular data and enhancing the model performance.
http://arxiv.org/abs/2407.08223v1
Compressor summary: Speculative RAG combines a generalist and specialist LLM to quickly generate and verify diverse, accurate, and updated responses using multiple subsets of retrieved documents.
http://arxiv.org/abs/2407.08221v1
Compressor summary: GAURA is a generalizable neural rendering method that can synthesize high-quality 3D scenes from imperfect images under various degradations without test-time optimization.
http://arxiv.org/abs/2407.08219v1
Compressor summary: The paper explores how large language models can generate contextually-relevant navigational instructions for blind and low-vision individuals in various scenarios, using a new dataset of images and goals.
http://arxiv.org/abs/2407.08215v1
Compressor summary: The paper presents a novel algorithm that uses active reinforcement learning and contextual data to improve stress detection efficiency and accuracy using PPG data from smartwatches.
http://arxiv.org/abs/2407.08214v1
Compressor summary: The paper introduces a novel approach called Stable Parallel Continual Learning (SPCL) to enhance the training stability of PCL for both forward and backward propagation, using orthogonal techniques to manage gradients and optimize network parameters.
http://arxiv.org/abs/2407.08209v1
Compressor summary: This paper proposes a novel method to generate more informative and consistent synthetic data for curvilinear object segmentation, using textual features and a ControlNet with spatial adaptation.
http://arxiv.org/abs/2407.08206v1
Compressor summary: The authors describe their approaches and results for the CEFE task at CCL-2024, achieving first place by using back-translation, NSP-based strategy, and Symmetric Cross Entropy loss.
http://arxiv.org/abs/2407.08204v1
Compressor summary: The paper proposes a data-driven method that aligns homologous chromosomes and diagnoses structural abnormalities using their similarity, improving upon existing methods.
http://arxiv.org/abs/2407.08200v1
Compressor summary: Our system uses computer vision to analyze live soccer matches, generating highlights, graphics, and insights for viewers.
http://arxiv.org/abs/2407.08199v1
Compressor summary: SRPose is a sparse keypoint-based method for estimating relative camera or object pose transformation from two RGB images, achieving competitive performance and generalizability.
http://arxiv.org/abs/2407.08196v1
Compressor summary: The study introduces SoupLM, a cost-efficient multimodal LLM assembled from different LLM variants with diverse training recipes, tasks, and data modalities.
http://arxiv.org/abs/2407.08195v1
Compressor summary: The paper presents a text-to-game engine that uses foundation models to generate interactive RPGs from simple text inputs, demonstrating the potential of generative AI for transforming the game industry.
http://arxiv.org/abs/2407.08192v1
Compressor summary: ARCO is a co-optimizing compilation framework that uses MARL to improve the efficiency and performance of mapping machine learning models onto diverse hardware platforms.
http://arxiv.org/abs/2407.08189v1
Compressor summary: fairBERTs is a framework that uses a generative adversarial network to remove bias from pre-trained language models while preserving their utility for natural language processing tasks.
http://arxiv.org/abs/2407.08188v1
Compressor summary: The text discusses how the lack of clear definitions for terms like "diversity" in machine learning datasets affects their creation and suggests ways to improve this process using social science principles.
http://arxiv.org/abs/2407.08187v1
Compressor summary: The authors propose ScaleDepth, a novel monocular depth estimation method that decomposes metric depth into scene scale and relative depth, achieving state-of-the-art performance across various scenes.
http://arxiv.org/abs/2407.08182v1
Compressor summary: This study evaluates using users' self-expression and traits in multi-task learning frameworks to improve predictions of their behavior based on NLP, finding that it enhances understanding and prediction.
http://arxiv.org/abs/2407.08179v1
Compressor summary: The CoGS framework uses s(CASP) to generate realistic and causally consistent counterfactual explanations for rule-based machine learning models, helping users understand decisions and changes in input attributes that lead to desired outcomes.
http://arxiv.org/abs/2407.08169v1
Compressor summary: The authors propose a new algorithm using Natural Gradient Descent for efficiently deleting data from trained machine learning models while maintaining strong privacy guarantees and improving generalization performance.
http://arxiv.org/abs/2407.08166v1
Compressor summary: The text describes how artificial intelligence can be used with electroretinogram tests to study and classify neurodevelopmental disorders like autism spectrum disorder.
http://arxiv.org/abs/2407.08164v1
Compressor summary: The paper proposes HC-MARL, a framework that uses contrastive learning to promote global consensus among agents without direct communication, enabling cooperative behavior in MARL tasks with dynamic requirements.
http://arxiv.org/abs/2407.08162v1
Compressor summary: The proposed system improves visual place recognition for robot navigation by using a multi-layer perceptron integrity monitor that reduces errors and increases success rates in real-world experiments.
http://arxiv.org/abs/2407.08156v1
Compressor summary: The study introduces Image Address Localization (IAL) problem for social media and photojournalism, proposing AddressCLIP, an end-to-end framework using contrastive learning and manifold learning to predict readable addresses from images, and creating three datasets for the task.
http://arxiv.org/abs/2407.08153v1
Compressor summary: The paper proposes a framework for lifelong whole slide image retrieval that addresses catastrophic forgetting by updating models on growing databases and maintaining distance consistency.
http://arxiv.org/abs/2407.08151v1
Compressor summary: Our method uses BLIP to extract content from source images, matches it with category information, and integrates it with SAM or YOLO for consistent Copy-Paste data augmentation in computer vision tasks.
http://arxiv.org/abs/2407.08150v1
Compressor summary: The authors introduce a new dataset (SRI-ADV) with multi-modal data from different users watching advertisement videos, and propose a model (HMLLM) to analyze cognitive understanding of video content across demographics.
http://arxiv.org/abs/2407.08149v1
Compressor summary: The paper presents a new method to estimate the shape and subsurface scattering parameters of translucent objects using polarization cues and introduces a large-scale synthetic dataset for training.
http://arxiv.org/abs/2407.08148v1
Compressor summary: SCPNet is a novel unsupervised framework for homography estimation that combines intra-modal self-supervised learning, correlation-based estimation, and consistent feature map projection, achieving state-of-the-art performance on various datasets.
http://arxiv.org/abs/2407.08147v1
Compressor summary: The paper introduces a new dataset and models for studying and classifying reduplication and repetition in Hindi, Telugu, and Marathi speech using computational linguistics.
http://arxiv.org/abs/2407.08137v1
Compressor summary: The survey explores different deep learning methods for creating realistic 3D scenes and discusses their pros and cons.
http://arxiv.org/abs/2407.08136v1
Compressor summary: EchoMimic is a novel approach that generates portrait videos using audio input and facial landmarks, overcoming the limitations of previous methods and outperforming alternatives in various evaluations.
http://arxiv.org/abs/2407.08134v1
Compressor summary: The paper presents a new neural network called Square-Highway that improves surface reconstruction from point clouds in computer graphics and medical imaging, compared to existing methods.
http://arxiv.org/abs/2407.08133v1
Compressor summary: This paper presents NVI, a large-scale annotated dataset for nonverbal interaction detection, and proposes NVI-DEHR, a hypergraph model that captures high-order interactions for better interpretation of nonverbal signals.
http://arxiv.org/abs/2407.08132v1
Compressor summary: The DMM model uses disparity information, multi-scale target-aware attention, and target-prior aware auxiliary task to improve multispectral oriented object detection while reducing computational complexity.
http://arxiv.org/abs/2407.08127v1
Compressor summary: The paper proposes a novel black-box model inversion attack method that uses a Prediction Alignment Encoder to map the target model's output into StyleGAN latent space, improving attack accuracy and reducing query numbers significantly.
http://arxiv.org/abs/2407.08126v1
Compressor summary: The paper introduces LEAP, a decoding method that uses label texts to disentangle overlapping events in audio-visual parsing, along with a semantic similarity loss function to improve performance and interpretability.
http://arxiv.org/abs/2407.08125v1
Compressor summary: The paper presents a real-time summarization system for Twitter that uses Dirichlet score and classifies tweets based on interest profiles, achieving good performance with metrics like MAP, CG, and DCG.
http://arxiv.org/abs/2407.08114v1
Compressor summary: The paper introduces a deep learning model that improves dental image analysis by combining ResNet50 with SimAM attention module, achieving superior performance and accuracy compared to traditional architectures.
http://arxiv.org/abs/2407.08113v1
Compressor summary: The paper presents dataset distillation techniques to synthesize realistic images from large datasets, addressing the limitation of bilateral equivalence that affects object recognition.
http://arxiv.org/abs/2407.08112v1
Compressor summary: The text discusses the challenges of modelling long sequences with deep neural networks, evaluates recent advances in this area, and highlights the need for more research on the extrapolation capabilities of different inductive biases.
http://arxiv.org/abs/2407.08107v1
Compressor summary: The paper proposes using machine learning techniques to predict sepsis onset, and evaluates them using various metrics, finding the meta-ensemble model to be the most accurate.
http://arxiv.org/abs/2407.08103v1
Compressor summary: The authors propose a method using automata theory to improve the output of language models for regular languages, which have many practical applications, while being efficient and easy to implement.
http://arxiv.org/abs/2407.08101v1
Compressor summary: The QEVD benchmark and dataset explore how vision-language models can handle open-ended, asynchronous fitness coaching interactions by recognizing actions, identifying mistakes, and providing feedback.
http://arxiv.org/abs/2407.08100v1
Compressor summary: The paper proves that popular adaptive optimization methods like Adam do not converge to any random limit point if learning rates are far from zero, even though they work well in practice.
http://arxiv.org/abs/2407.08099v1
Compressor summary: The study shows that Burrows' Delta method can effectively attribute authorship in Chinese poetry by identifying similarities among texts from the same poet and distinguishing them from other poets.