This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-18 generated by the compressor, my personal LLM-based project.
http://arxiv.org/abs/2407.12784v1
Compressor summary: AgentPoison is a novel backdoor attack that targets long-term memory or RAG knowledge bases of LLM agents, allowing them to be compromised without additional model training or fine-tuning while maintaining normal performance for benign instructions.
http://arxiv.org/abs/2407.12783v1
Compressor summary: SMooDi is a new model that can generate stylized motion for different content and styles, using text guidance and a lightweight adaptor.
http://arxiv.org/abs/2407.12782v1
Compressor summary: CAT is a novel approach that uses labeled source domain samples to improve feature generation for the target domain in adversarial training, addressing challenges related to robustness, generalization, and alignment.
http://arxiv.org/abs/2407.12781v1
Compressor summary: The text describes a new method that allows controlling camera movement in text-to-video synthesis using transformer-based models and spatiotemporal embeddings.
http://arxiv.org/abs/2407.12777v1
Compressor summary: This paper introduces a new method using Gaussian Splatting to accurately render 3D humans from sparse views, by learning generalizable human Gaussians and leveraging 2D UV space of a template.
http://arxiv.org/abs/2407.12773v1
Compressor summary: The study proposes an AI system to automatically detect mitotic figures in cancer images, improving accuracy and consistency in grading and treatment decisions.
http://arxiv.org/abs/2407.12772v1
Compressor summary: The paper introduces a benchmark framework, LMMS-EVAL, for evaluating large multi-modal models, and proposes two new tools, LMMS-EVAL LITE and Multimodal LIVEBENCH, to address the evaluation trilemma of cost, coverage, and contamination.
http://arxiv.org/abs/2407.12759v1
Compressor summary: This paper reviews methods to interpret random forest models and provides a taxonomy of techniques for choosing appropriate tools based on interpretability aspects.
http://arxiv.org/abs/2407.12758v1
Compressor summary: The paper proposes a new unsupervised learning method for cross-modality pedestrian image retrieval, using mutual information, three learning principles, and iterative training with optimal transport assignment and prototype-based contrastive learning.
http://arxiv.org/abs/2407.12753v1
Compressor summary: LookupViT is a novel vision transformer block that compresses information from high-resolution tokens to reduce inference cost while maintaining or improving accuracy on various tasks such as image and video classification, and captioning.
http://arxiv.org/abs/2407.12749v1
Compressor summary: HDLCopilot is a natural language-based system that helps hardware engineers find information in PDKs faster and more accurately, using an LLM to understand complex queries and provide relevant results.
http://arxiv.org/abs/2407.12739v1
Compressor summary: GroundUp is a tool that helps architects design 3D urban areas by converting their sketches into 3D models and allowing them to revise quickly.
http://arxiv.org/abs/2407.12736v1
Compressor summary: CHOSEN is a co-design framework that automates ViT deployment on FPGAs, improving performance by using multi-kernel design, approximate non-linear functions, efficient logic block usage, and a novel compiler algorithm.
http://arxiv.org/abs/2407.12735v1
Compressor summary: EchoSight is a multimodal framework that uses retrieval-augmented generation to help large language models answer visual questions requiring encyclopedic knowledge.
http://arxiv.org/abs/2407.12734v1
Compressor summary: The paper introduces a Minecraft builder task benchmark for evaluating large language models' spatial reasoning and vector math skills.
http://arxiv.org/abs/2407.12730v1
Compressor summary: Uni-Food is a large, unified food dataset for vision-language tasks that includes images, categories, ingredients, recipes, and nutritional information, while RoDE is a novel method to improve LMMs by allocating parameters based on task complexity.
http://arxiv.org/abs/2407.12727v1
Compressor summary: The paper introduces NL2Contact, a model that generates realistic 3D hand-object contacts from natural language descriptions, and ContactDescribe, a dataset for training the model.
http://arxiv.org/abs/2407.12725v1
Compressor summary: The text introduces SarcasmCue, a new framework to improve large language models' sarcasm detection by using different prompting strategies that combine sequential and non-sequential methods.
http://arxiv.org/abs/2407.12724v1
Compressor summary: This paper proposes a meta-learning method for semiconductor defect inspection that adapts to new defect types without forgetting previous knowledge or requiring large datasets.
http://arxiv.org/abs/2407.12718v1
Compressor summary: SlimFlow is a framework for developing small and efficient one-step diffusion models using rectified flow, addressing challenges like initialization mismatch and distillation issues with Annealing Reflow and Flow-Guided Distillation.
http://arxiv.org/abs/2407.12710v1
Compressor summary: The paper proposes a method for developing learn-to-defer systems that work with human experts, optimizing accuracy under various constraints using a generalization of Neyman and Pearson's lemma and showing improved results on COMPAS and ACSIncome datasets.
http://arxiv.org/abs/2407.12709v1
Compressor summary: The paper proposes a method called mixture of multimodal experts (MoME) to improve generalist large language models on vision-language tasks by modulating features and incorporating sparsely gated experts.
http://arxiv.org/abs/2407.12705v1
Compressor summary: The text introduces a new virtual dressing method (IMAGDressing-v1) that allows users to edit and control clothing images in various scenes, using a novel metric (CAMI) and a large dataset (IGPair).
http://arxiv.org/abs/2407.12703v1
Compressor summary: The paper proposes a new method for knowledge graph completion that leverages the structural properties of the graphs and improves performance over existing methods.
http://arxiv.org/abs/2407.12702v1
Compressor summary: TransCAD is a transformer-based model that predicts 3D CAD models from point clouds using a hierarchical learning strategy and a loop refiner, achieving state-of-the-art results with a new metric for CAD sequence evaluation.
http://arxiv.org/abs/2407.12697v1
Compressor summary: The paper proposes a new method, Diverse Ensemble Entropy Minimization (DEnEM), to improve real-time prostate cancer detection using micro-ultrasound and deep learning, addressing the challenge of data distribution shifts across clinical centers.
http://arxiv.org/abs/2407.12684v1
Compressor summary: Key points: - Text-to-4D generation method using text-to-video diffusion model and reference video supervision - Two stages: static 3D generation and dynamic generation with customized SDS losses and prior-switching strategy - Dynamic modeling representation for deformation and topology changes - Better realism, consistency, and quality than existing methods Summary: The paper presents a text-to-video diffusion method for text-to-4D generation, which uses reference video supervision to ensure realistic and dynamic motion. It combines customized SDS losses and prior-switching strategy for 3D and temporal consistency, and introduces a dynamic modeling representation for shape and topology changes.
http://arxiv.org/abs/2407.12682v1
Compressor summary: The authors propose a new approach to accurately map in-situ data for LPBF defect detection using novel IR features and demonstrate its effectiveness through printing, monitoring, and characterizing various parts.
http://arxiv.org/abs/2407.12679v1
Compressor summary: Goldfish is a method for comprehending videos of any length using efficient retrieval and MiniGPT4-Video, achieving significant improvements in both long and short video understanding.
http://arxiv.org/abs/2407.12676v1
Compressor summary: The paper proposes a method to improve diffusion model-based inverse problem solving by using a pretrained consistency model and enforcing constraints during the sampling process, achieving high reconstruction quality with fewer inference steps.
http://arxiv.org/abs/2407.12669v1
Compressor summary: This paper explores how to use privacy-preserving techniques, such as synthetic data generation and differentially private learning, to improve deep learning models for breast cancer detection from mammography images while protecting patient data.
http://arxiv.org/abs/2407.12667v1
Compressor summary: Key points: - The paper presents a novel approach to handle noisy camera poses in Neural Radiance Fields (NeRFs) using scene graphs and confidence estimation. - The method uses an IoU loss, a coarse-to-fine strategy, and a new dataset with outlier poses for evaluation. - The results show the superiority of the proposed approach over existing methods in robustness and quality. Summary: The paper proposes a scene graph-based NeRF method that can handle noisy camera poses effectively, using confidence estimation, an IoU loss, and a new dataset.
http://arxiv.org/abs/2407.12665v1
Compressor summary: The paper proposes patch-level training for large language models, which reduces computational costs by compressing multiple tokens into a single patch and predicting the next patch instead of the next token.
http://arxiv.org/abs/2407.12661v1
Compressor summary: Key points: - The paper proposes a method to regularize geometric modeling for 3D scene reconstruction from multi-view images - The method encourages mutual information among surface normals of highly correlated scene points - The method uses semantic and geometric features to identify correlated points - The method improves the surface reconstruction quality of SDF-based neural surfaces Summary: The paper introduces a technique that improves 3D scene reconstruction from multi-view images by regularizing geometric modeling with mutual information among normals of correlated points, based on semantic and geometric features.
http://arxiv.org/abs/2407.12647v1
Compressor summary: The paper proposes a new method, OF-GPRN, that improves UAV detection in dual-vision images using optical fusion and graph-pooling residual network, achieving an 17.9% higher mAP than ResGCN.
http://arxiv.org/abs/2407.12642v1
Compressor summary: Key points: - Text-guided image synthesis has many applications but faces challenges - The proposed approach uses Large Language Models (LLMs) for global coherence and local context understanding - The model expands images conditioned on captions from LLMs and visual features - The model shows superior performance and zero-shot capability with LLM guidance Summary: The paper presents a novel text-guided image synthesis method that uses LLMs to generate coherent and contextualized captions for expanding images in arbitrary sizes.
http://arxiv.org/abs/2407.12637v1
Compressor summary: The paper proposes an adaptive quantization method for low-bit fixed-point training that minimizes the quantization error for large gradients using an optimal interval and an update algorithm.
http://arxiv.org/abs/2407.12632v1
Compressor summary: CerberasDet is a YOLO-based multi-headed framework for efficient object detection across multiple tasks and datasets.
http://arxiv.org/abs/2407.12629v1
Compressor summary: The paper proves that AdaGrad and Adam, two adaptive gradient methods, have linear convergence when the cost function meets the Polyak-{\L}ojasiewicz inequality, using a simple and unified approach for batch and stochastic gradients.
http://arxiv.org/abs/2407.12626v1
Compressor summary: This study explores how domain-specific models and uncertainty estimation affect the entropy of a model's output probability distribution in biomedical applications.
http://arxiv.org/abs/2407.12622v1
Compressor summary: This paper improves generic event boundary detection (GEBD) models by simplifying their architecture, reducing redundancy, and enhancing spatiotemporal learning for faster and more accurate results in real-world applications.
http://arxiv.org/abs/2407.12620v1
Compressor summary: The text explores AI and NLP applications for preserving endangered Indigenous languages through community engagement, machine translation, and interactive language models.
http://arxiv.org/abs/2407.12616v1
Compressor summary: The text proposes a framework that efficiently adapts unimodal models to handle missing data and predict the missing modality's embedding using self-supervised learning.
http://arxiv.org/abs/2407.12614v1
Compressor summary: This study developed a fast and accurate machine vision system for monitoring strawberry growth and yield using pruned deep learning models and an enhanced object tracking algorithm.
http://arxiv.org/abs/2407.12599v1
Compressor summary: The paper presents a neural network architecture that leverages diversity principles to achieve high accuracy in self-supervised and semi-supervised learning tasks.
http://arxiv.org/abs/2407.12598v1
Compressor summary: The study proposes a method called algebraically observable PINNs to estimate epidemiological parameters using noisy and partial trajectory data.
http://arxiv.org/abs/2407.12597v1
Compressor summary: The study compares different YOLO models for automated wrist fracture detection and finds that they outperform the commonly used two-stage algorithm, Faster R-CNN, especially for pediatric patients.
http://arxiv.org/abs/2407.12594v1
Compressor summary: VisFocus is an OCR-free method for visual document understanding that uses attention mechanisms and pre-training to focus on relevant text patches based on the language prompt.
http://arxiv.org/abs/2407.12593v1
Compressor summary: The text describes using event cameras to improve sign language recognition and translation, introducing a new dataset (EvSign) and an efficient transformer-based framework for these tasks.
http://arxiv.org/abs/2407.12592v1
Compressor summary: VegeDiff is a diffusion model that probabilistically captures uncertainties in geospatial vegetation forecasting, separately modeling the impacts of dynamic meteorological and static environmental variables, and outperforms existing deterministic methods.
http://arxiv.org/abs/2407.12589v1
Compressor summary: Fed-Protoid is a novel method for privacy-preserving person re-identification that adapts models on edge devices using distributed source prototypes and minimizes a customized MMD loss.
http://arxiv.org/abs/2407.12582v1
Compressor summary: The paper proposes a novel method for object detection using event cameras and frame cameras, which improves performance and robustness under challenging conditions.
http://arxiv.org/abs/2407.12580v1
Compressor summary: E5-V is a framework that adapts large language models for creating universal multimodal embeddings, improving performance and reducing training costs compared to previous approaches.
http://arxiv.org/abs/2407.12579v1
Compressor summary: The paper presents RFBench, a benchmark for evaluating image generation from realistic-fantasy prompts, and RFNet, a training-free method combining diffusion models with LLMs to generate better images.
http://arxiv.org/abs/2407.12569v1
Compressor summary: Kolmogorov-Arnold Network can be trained privately and performs similarly to Multilayer Perceptron in differentially private settings.
http://arxiv.org/abs/2407.12568v1
Compressor summary: Reflecting learning is a novel learning paradigm that uses reviewing, summarizing, and correcting processes to improve long-tail recognition performance.
http://arxiv.org/abs/2407.12553v1
Compressor summary: The paper proposes a pipeline using reservoir computing and directed graph analysis for efficient brain representation in stroke data derived from MRI, enabling classification of effective connectivity and interpretation of disrupted networks with explainable AI tools.
http://arxiv.org/abs/2407.12550v1
Compressor summary: UniTE is a survey and a unified pipeline for pre-training trajectory embeddings that simplifies their development and analysis.
http://arxiv.org/abs/2407.12543v1
Compressor summary: Abstraction alignment is a method to measure how well an ML model's learned representations match human-expected abstractions using a human abstraction graph.
http://arxiv.org/abs/2407.12532v1
Compressor summary: The framework trains large language models as collaborative agents for coordinated behaviors in cooperative MARL by sharing private intentions, adapting comprehension strategies, and dynamically re-planning sub-tasks.
http://arxiv.org/abs/2407.12529v1
Compressor summary: Crafting the Path is a novel structured query rewriting method that improves information retrieval by generating relevant queries using a three-step process and is less dependent on Large Language Models' internal knowledge.
http://arxiv.org/abs/2407.12528v1
Compressor summary: The paper proposes a new algorithm for identifying causal parameters in linear structural models from observational data and proves that the identification task is computationally hard in general.
http://arxiv.org/abs/2407.12522v1
Compressor summary: Struct-X is a framework that helps large language models use structured data more effectively by encoding it into a topological space and guiding them through five phases to enhance reasoning abilities.
http://arxiv.org/abs/2407.12519v1
Compressor summary: CLTD is a new method for gait recognition that uses attention and projection to separate identity features from non-identity clues in spatial, temporal, and spectral domains.
http://arxiv.org/abs/2407.12517v1
Compressor summary: The paper evaluates how well deep learning models can learn from diverse climate data and transfer their knowledge across different tasks, locations, and variables.
http://arxiv.org/abs/2407.12514v1
Compressor summary: The paper explores why random initialization schemes outperform some pre-trained word and sub-word embeddings in transformer models, and suggests standardizing the embeddings' values as a solution.
http://arxiv.org/abs/2407.12512v1
Compressor summary: This paper introduces class-wise hardness and proposes GeoHard, a metric that measures the difficulty of different classes in natural language understanding tasks by analyzing their semantic embeddings.
http://arxiv.org/abs/2407.12511v1
Compressor summary: CoLIE is a novel approach that enhances low-light images by mapping coordinates to illumination components and reducing computational overhead, making it more adaptable and practical for various scenes and tasks.
http://arxiv.org/abs/2407.12508v1
Compressor summary: MERLIN is a system that uses large language models to improve text-video retrieval by aligning user queries with video content.
http://arxiv.org/abs/2407.12505v1
Compressor summary: This paper introduces Subequivariant Hierarchical Neural Networks (SHNN) for learning policies in complex 3D multi-entity systems, using local entity-level graphs and subequivariant message passing, and proposes a new benchmark (MEBEN) to evaluate them.
http://arxiv.org/abs/2407.12504v1
Compressor summary: The paper proposes a Case2Code task to evaluate and teach large language models (LLMs) inductive reasoning by synthesizing input-output transformations for executable programs and training LLMs on these synthetic cases.
http://arxiv.org/abs/2407.12501v1
Compressor summary: EmoFace is a novel audio-driven method for creating emotionally expressive 3D face animations with natural blinks, eye movements, and lip synchronization, and it comes with a new emotional dataset for MetaHuman models.
http://arxiv.org/abs/2407.12500v1
Compressor summary: The paper presents a case study of using automated systems to identify gender-biased language in US capital trials for women defendants, finding that they can help lawyers challenge their bias and refine annotation rules, but cannot replace human expertise.
http://arxiv.org/abs/2407.12498v1
Compressor summary: This study evaluates how well large language models perform on a benchmark with different learning methods and shows that they improve when using image captions or interleaved data, and few-shot learning.
http://arxiv.org/abs/2407.12492v1
Compressor summary: The paper proposes a probabilistic model that adapts a deployed model to distribution shifts by learning hidden feature dynamics and class prototypes without labels or model backbone access.
http://arxiv.org/abs/2407.12491v1
Compressor summary: The paper proposes a new hierarchical BEV perception paradigm for autonomous driving systems, using a library of modules and a user-friendly interface to solve challenges like lengthy development cycles and poor reusability.
http://arxiv.org/abs/2407.12483v1
Compressor summary: The paper introduces a semi-automated system, VARS, that improves football fairness and accuracy by analyzing multi-view videos of fouls and suggesting sanctions.
http://arxiv.org/abs/2407.12481v1
Compressor summary: Key points: - A novel approach to data preparation for multilingual Indic large language model - Data from open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia - Custom preprocessing pipeline for each Indic language to eliminate redundant and low-quality text content - Deduplication on Common Crawl data to address redundancy in web pages - A novel multilingual tokenizer training strategy that outperforms OpenAI Tiktoken tokenizer for Indic languages Summary: The authors propose a new method to prepare data for a multilingual Indic large language model using diverse and rich sources, custom preprocessing, deduplication, and a novel tokenizer training strategy.
http://arxiv.org/abs/2407.12473v1
Compressor summary: The study uses refined BERT-based parsers to convert PDTB and RST annotations into dependencies, enabling unified analysis of different discourse corpora across languages.
http://arxiv.org/abs/2407.12470v1
Compressor summary: This paper proposes a continual learning framework for question answering with temporal memory and contrastive learning, and creates a new dataset to support the research.
http://arxiv.org/abs/2407.12453v1
Compressor summary: The authors propose using reinforcement learning to find the cheapest way for a system to transition between stable states in its energy landscape.
http://arxiv.org/abs/2407.12448v1
Compressor summary: EDIS improves offline-to-online RL by using a diffusion model to extract prior knowledge from offline data and energy functions to generate better online data.
http://arxiv.org/abs/2407.12445v1
Compressor summary: The paper presents a new framework for Sustainable Machine Learning that considers fairness, privacy, interpretability, and greenhouse gas emissions, and proposes a meta-learning algorithm to help users select optimal model architectures based on their requirements.
http://arxiv.org/abs/2407.12440v1
Compressor summary: GraphGuard is a new method that uses graphs and self-supervised learning to detect credit card fraud better than existing methods.
http://arxiv.org/abs/2407.12438v1
Compressor summary: The study explores how semantic-aware embeddings can improve information retrieval from large, diverse, and temporal data lakes in various application domains.
http://arxiv.org/abs/2407.12437v1
Compressor summary: VACERL is a framework that uses causal relationships to guide exploration in RL without assuming environmental causal variables, improving agent performance especially in challenging domains.
http://arxiv.org/abs/2407.12435v1
Compressor summary: The paper introduces Semantic-HOI, a new dataset for 3D human object interaction, and proposes F-HOI, a unified model that leverages multimodal instructions to handle diverse HOI tasks.
http://arxiv.org/abs/2407.12431v1
Compressor summary: The paper proposes GLARE, a new Low-Light Image Enhancement network that uses a codebook prior derived from normal-light images and a generative module to align low-light features with normal-light latent representations, resulting in improved performance on various benchmarks and real-world data.
http://arxiv.org/abs/2407.12427v1
Compressor summary: The paper presents GeneralAD, a framework that detects and generates semantic, near-distribution, and industrial anomalies using Vision Transformers and attention-based discriminators, achieving high performance across various datasets.
http://arxiv.org/abs/2407.12426v1
Compressor summary: The paper explores using fine-tuning techniques on RoBERTa to improve semantic textual relatedness across different languages, with promising results in Latin languages but challenges in Arabic.
http://arxiv.org/abs/2407.12425v1
Compressor summary: EACon is a framework that helps verify claims by abstracting evidence and deconstructing the claim into subclaims, improving the performance of large language models.
http://arxiv.org/abs/2407.12421v1
Compressor summary: Key points: - Power grids are complex infrastructures that need effective analysis methods - Machine learning techniques, especially Graph Neural Networks (GNNs), can help with grid analysis problems - Existing benchmarks and datasets do not consider safety and robustness requirements for power grids - SafePowerGraph is a new framework and benchmark that integrates multiple simulators and assesses GNN performance under realistic scenarios - Self-supervised learning and graph attention architectures are important for GNN robustness Summary: SafePowerGraph is a novel safety-oriented framework and benchmark for evaluating Graph Neural Networks in power grid analysis, using multiple simulators and diverse scenarios.
http://arxiv.org/abs/2407.12419v1
Compressor summary: DBGNNs are a new type of graph neural network that uses the topological Dirac equation to capture complex graph dynamics and outperforms conventional MPNNs for long-range predictions.
http://arxiv.org/abs/2407.12417v1
Compressor summary: The paper proposes a unimodal regularisation method that improves classification of extreme classes in ordinal problems using the generalised beta distribution and shows superior results compared to other methods.
http://arxiv.org/abs/2407.12415v1
Compressor summary: Frequency Dynamic Fusion (FreDF) is a novel time series forecasting method that captures long-term dependency by predicting and fusing different frequencies in the Fourier domain, adapting to various scenarios.
http://arxiv.org/abs/2407.12404v1
Compressor summary: The paper investigates the reliability and generalization properties of steering vectors for language models, finding that they have limitations in terms of variable effectiveness and brittleness.
http://arxiv.org/abs/2407.12402v1
Compressor summary: The paper introduces TurkishMMLU, a multitask, multiple-choice Turkish QA benchmark to evaluate LLMs' understanding of the Turkish language across various subjects in high-school curricula.
http://arxiv.org/abs/2407.12401v1
Compressor summary: The paper proposes GOAR, a geometric feature-perturbation approach for XAI that overcomes the limitations of pixel-perturbation methods like ROAR.
http://arxiv.org/abs/2407.12399v1
Compressor summary: The paper proposes an optimization method for simplifying scalar data that preserves important features and can handle different topological structures, making it practical for real-life datasets and improving analysis and visualization.
http://arxiv.org/abs/2407.12397v1
Compressor summary: This paper explores how post-training quantization affects recurrent language models, identifying activation outliers as a challenge similar to transformer-based models.
http://arxiv.org/abs/2407.12395v1
Compressor summary: EDUS is a new method for fast and efficient urban view synthesis from sparse input images using noisy predicted geometric priors.
http://arxiv.org/abs/2407.12393v1
Compressor summary: This study proposes PersLLM, a method to integrate psychology-grounded principles of personality into large language models, enhancing their utility in domains like social simulations and human-machine interactions.
http://arxiv.org/abs/2407.12389v1
Compressor summary: The text discusses how AI and ML advances enable researchers to compare language development across 27 languages using new transcription and analysis methods.
http://arxiv.org/abs/2407.12383v1
Compressor summary: RECE is a novel approach that efficiently modifies text-to-image models to erase inappropriate concepts without compromising their generation ability or requiring additional fine-tuning.
http://arxiv.org/abs/2407.12376v1
Compressor summary: The study develops an advanced deep learning model for sentiment analysis of Olympic-related tweets, achieving the highest accuracy with the BERT model.
http://arxiv.org/abs/2407.12375v1
Compressor summary: FETCH is a two-stage compression method that improves accuracy in class-incremental continual learning using compressed replay with GDumb.
http://arxiv.org/abs/2407.12371v1
Compressor summary: The HIMO dataset provides a large collection of full-body human interactions with multiple objects, along with textual descriptions and temporal segments, for training models on two novel tasks: HOI synthesis and fine-grained timeline control.
http://arxiv.org/abs/2407.12370v1
Compressor summary: This study analyzes the temporal receptive field in dynamic graph learning, showing its importance for accurate prediction in evolving networks.
http://arxiv.org/abs/2407.12366v1
Compressor summary: The authors propose a method to bridge the gap between large language models and vision-and-language navigation tasks by aligning visual content in a frozen language model, enabling better integration of language and navigation policy networks.
http://arxiv.org/abs/2407.12363v1
Compressor summary: GuideCQR is a framework that improves conversational search by using guided documents to refine queries, generate expected answers, and filter results, outperforming previous methods and LLM prompts.
http://arxiv.org/abs/2407.12358v1
Compressor summary: ProcTag is a data-oriented method that evaluates document instruction datasets by tagging the execution process of instructions, enabling selective sampling or filtering for training large language models on document visual question answering tasks.
http://arxiv.org/abs/2407.12357v1
Compressor summary: Graph-based explanations improve usability but not understanding for AI recommendations compared to textual explanations.
http://arxiv.org/abs/2407.12356v1
Compressor summary: The paper introduces a new layout similarity measure based on optimal transport, which can handle various layout differences and is applicable to many layout generation tasks.
http://arxiv.org/abs/2407.12354v1
Compressor summary: The paper presents a new method for optimizing camera pose and NeRF using overparameterized representations, rigid warp functions, and invertible neural networks.
http://arxiv.org/abs/2407.12346v1
Compressor summary: The proposed object-aware query perturbation framework improves cross-modal image-text retrieval for small objects by generating a key feature subspace of the detected objects and perturbing the queries using this subspace.
http://arxiv.org/abs/2407.12345v1
Compressor summary: Key points: - The paper proposes a novel method for trajectory prediction that uses visual inputs from surround-view cameras and textual descriptions generated by VLM and refined by LLM. - The method achieves a low latency of 53 ms, making it feasible for real-time processing. - The method outperforms previous methods with similar performance and creates a new dataset (nuScenes-Text) with rich textual annotations. Summary: The paper presents a fast and accurate trajectory prediction method that leverages visual and textual cues from surround-view cameras and VLM/LLM-generated descriptions, and introduces a new dataset with these annotations.
http://arxiv.org/abs/2407.12344v1
Compressor summary: The paper explores how personality traits affect LLM safety abilities and shows that modifying these traits can improve their performance in various aspects.
http://arxiv.org/abs/2407.12342v1
Compressor summary: WordFS is a new method to reduce word embedding dimensions while maintaining efficiency and effectiveness in various natural language processing tasks.
http://arxiv.org/abs/2407.12336v1
Compressor summary: The text discusses the need for a multilingual dataset for multi-document summarization (M2DS) in today's globalized digital world, which is provided by BBC articles in five languages.
http://arxiv.org/abs/2407.12332v1
Compressor summary: The paper explains why some models can generalize well on modular addition problem even after overfitting, by transitioning from kernel-like to gradient descent-like behavior.
http://arxiv.org/abs/2407.12331v1
Compressor summary: The paper proposes I2AM, a method to enhance interpretability of image generation models by aggregating patch-level cross-attention scores, enabling detailed attribution analysis and evaluation for reference-based image inpainting tasks.
http://arxiv.org/abs/2407.12330v1
Compressor summary: This paper proposes an energy model-based instance-wise calibration method for deep neural networks to improve their uncertainty estimation and reliability in multi-class classification tasks, especially for out-of-distribution inputs.
http://arxiv.org/abs/2407.12327v1
Compressor summary: The Spectra LLM suite explores low-bitwidth language models and their performance, training dynamics, and scaling trends, with promising results in size reduction and commonsense reasoning, but challenges in toxicity and perplexity.
http://arxiv.org/abs/2407.12317v1
Compressor summary: The paper proposes a novel method for long text recognition, called OOL Text Recognition with sub-String Matching (SMTR), which uses cross-attention and regularization training to handle arbitrary length text.
http://arxiv.org/abs/2407.12315v1
Compressor summary: ModalChorus is an interactive system for visual probing and alignment of multi-modal embeddings, using a two-stage process with Modal Fusion Map and embedding alignment to enhance modality fusion and intention articulation.
http://arxiv.org/abs/2407.12309v1
Compressor summary: MEDFuse is a framework that fuses structured and unstructured EHR data using multimodal embeddings to improve clinical decision-making, achieving high performance in multi-label classification tasks.
http://arxiv.org/abs/2407.12307v1
Compressor summary: The text introduces a weakly-supervised method for 3D hand reconstruction that uses hand knowledge from different sources and uncertainty modeling to train with 2D landmark annotations, improving performance over existing methods.
http://arxiv.org/abs/2407.12306v1
Compressor summary: Splatfacto-W is a novel view synthesis method that improves scene consistency in unconstrained images by incorporating per-Gaussian neural color features, per-image appearance embeddings, and a spherical harmonics-based background model.
http://arxiv.org/abs/2407.12295v1
Compressor summary: The paper proposes a codebook-based remote sensing image compression method that leverages VQGAN to generate a discrete codebook and uses a Transformer-based prediction model and a hierarchical prior integration network to enhance the decoding performance.
http://arxiv.org/abs/2407.12292v1
Compressor summary: The GAKer method generates adversarial examples that can fool deep neural networks into recognizing any image as a target object, improving attack success rates for both known and unknown classes.
http://arxiv.org/abs/2407.12291v1
Compressor summary: Joint Score Distillation (JSD) improves text-to-3D generation by considering coherence among views and using energy functions to capture view-aware information.
http://arxiv.org/abs/2407.12282v1
Compressor summary: The authors propose a diffusion model and a novel architecture for macro placement in digital circuits, which achieves competitive performance and trains at scale using synthetic datasets.
http://arxiv.org/abs/2407.12279v1
Compressor summary: ER-FSL is a novel online continual learning method that uses multiple feature subspaces to learn new data and replays old data in a larger feature space to prevent catastrophic forgetting.
http://arxiv.org/abs/2407.12277v1
Compressor summary: The paper proposes a multi-modal reranker for visual question answering that uses cross-item interaction to improve ranking quality and relevance score modeling of knowledge candidates.
http://arxiv.org/abs/2407.12275v1
Compressor summary: The text discusses how transformers can compose tasks from independent components, but struggle to generalize compositionally unless there's a bottleneck separating task inference and execution.
http://arxiv.org/abs/2407.12273v1
Compressor summary: GRIDS is a new image restoration method that groups similar degradations and improves efficiency and effectiveness of restoration.
http://arxiv.org/abs/2407.12271v1
Compressor summary: The paper presents a new method to accurately detect retinal branching angles using image processing and provides an open-source annotation tool and a benchmark dataset for evaluation.
http://arxiv.org/abs/2407.12269v1
Compressor summary: Unified Temporal Graph (UTG) is a framework that combines snapshot- and event-based machine learning models for temporal graphs, improving their performance and efficiency.
http://arxiv.org/abs/2407.12267v1
Compressor summary: Key points: - New approach for generating 3D house wireframes with semantic enrichment using an autoregressive model - Unified wire-based representation for improved coherence and semantic integration - Graph-based autoencoder and transformer-based decoder for learning geometric tokens and generating wireframes - Iterative prediction and decoding for detailed wireframes that can be segmented into components Summary: The authors propose a novel autoregressive model that uses a unified wire-based representation to generate 3D house wireframes with semantic enrichment, achieving superior accuracy and novelty.
http://arxiv.org/abs/2407.12259v1
Compressor summary: In-context probing is a useful method for valuing and selecting training data for large language models, as it approximates the influence functions that estimate the contribution of data to model predictions.
http://arxiv.org/abs/2407.12258v1
Compressor summary: The paper describes a method for facial expression analysis using Transformer Encoder and visual features, achieving better results than previous methods in a competition.
http://arxiv.org/abs/2407.12257v1
Compressor summary: The paper proposes an ensemble learning method using different neural networks to recognize complex human emotional expressions accurately.
http://arxiv.org/abs/2407.12255v1
Compressor summary: The paper proposes a novel end-to-end network (DHAN-SHR) that uses hybrid attention mechanisms to remove specular highlights from images without relying on additional priors or supervision, achieving state-of-the-art performance and introducing a large-scale benchmark dataset.
http://arxiv.org/abs/2407.12254v1
Compressor summary: COKE is a method that uses expert knowledge and chronological order to construct causal graphs for fault diagnosis and optimization in manufacturing processes without imputing missing data, achieving significant improvement in F1-score compared to benchmark methods.
http://arxiv.org/abs/2407.12247v1
Compressor summary: The paper proposes a bidirectional RNN model for predicting missing characters in ancient manuscripts, which can help scholars rank possible reconstructions but not reconstruct the text definitively.
http://arxiv.org/abs/2407.12243v1
Compressor summary: The thesis aims to improve interpretability of deep neural networks by introducing self-explanatory designs, studying neuron activation phenomena, and analyzing visual analytics applications.
http://arxiv.org/abs/2407.12240v1
Compressor summary: Our method adapts a pre-trained model to different unlabelled domains at test time by updating both feature extractor and classifier using meta-learning and cascading, with new evaluation metrics for real-world scenarios.
http://arxiv.org/abs/2407.12239v1
Compressor summary: The text describes a new approach to solve computer vision problems using event cameras, which are more challenging than standard cameras and have better performance in certain scenarios.
http://arxiv.org/abs/2407.12238v1
Compressor summary: The study proposes a new framework using Graph Neural Networks and Adaptive Conformal Prediction to improve traffic flow prediction, outperforming existing models.
http://arxiv.org/abs/2407.12234v1
Compressor summary: The text proposes a meta-learning framework for efficiently solving parabolic PDEs across different scenarios by learning from existing simulations.
http://arxiv.org/abs/2407.12223v1
Compressor summary: The paper proposes Conditional Quantile Estimation (CQE), a novel technique that uses quantile regression to capture the nuanced distribution of watch time in short video recommendation, improving accuracy and robustness.
http://arxiv.org/abs/2407.12220v1
Compressor summary: The text discusses the prevalence of questionable research practices in evaluating large language models and their negative impact on reproducibility.