This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-11 generated by the compressor, my personal LLM-based project.
http://arxiv.org/abs/2407.07895v1
Compressor summary: The paper introduces LLaVA-NeXT-Interleave, a Large Multimodal Model that handles multiple scenarios (multi-image, multi-frame, multi-view, and multi-patch) using the interleaved data format and the M4-Instruct dataset.
http://arxiv.org/abs/2407.07890v1
Compressor summary: The text discusses how training large language models on task-relevant data during pretraining can influence their performance and emergent behavior evaluations, and proposes a method to adjust for this issue by fine-tuning models on the same data before evaluation.
http://arxiv.org/abs/2407.07880v1
Compressor summary: The study proposes Dr. DPO, a method that enhances the robustness of Direct Preference Optimization (DPO) by integrating pairwise and pointwise noise resilience using Distributionally Robust Optimization (DRO).
http://arxiv.org/abs/2407.07874v1
Compressor summary: Toto is a new foundation model for time series forecasting that excels at both observability and general-purpose forecasting tasks.
http://arxiv.org/abs/2407.07873v1
Compressor summary: The paper presents a new framework for sampling from probability densities using physics-informed neural networks that solve partial differential equations, improving efficiency and coverage compared to existing methods.
http://arxiv.org/abs/2407.07860v1
Compressor summary: 4DiM is a novel view synthesis model that uses diffusion on 3D, 4D, and video data, enabling better fidelity and pose control, as well as handling temporal dynamics in scenes.
http://arxiv.org/abs/2407.07858v1
Compressor summary: Key points: - Enterprise chatbots use generative AI to boost employee productivity - RAG, LLMs, and orchestration frameworks are essential for building these chatbots - Creating effective chatbots is challenging and requires careful engineering of RAG pipelines - The authors present a framework (FACTS) and provide empirical results on tradeoffs between large and small LLMs Summary: The paper introduces FACTS, a framework for building secure enterprise chatbots using generative AI, and discusses the challenges and tradeoffs involved.
http://arxiv.org/abs/2407.07853v1
Compressor summary: Progressive Growing of Patch Size is a resource-efficient method for dense prediction tasks that gradually increases difficulty by growing the patch size during model training, improving convergence and reducing costs.
http://arxiv.org/abs/2407.07852v1
Compressor summary: OpenDiLoCo is an open-source tool that enables efficient and scalable training of large language models across multiple locations with low communication and high compute utilization.
http://arxiv.org/abs/2407.07848v1
Compressor summary: Our study reveals how sparsity patterns in ReLU Transformers vary across layers and time, affecting feature learning and causing some dimensions to turn off during training.
http://arxiv.org/abs/2407.07844v1
Compressor summary: The paper proposes OV-DINO, a method for open-vocabulary detection that pre-trains on diverse datasets and uses language-aware selective fusion to improve performance.
http://arxiv.org/abs/2407.07842v1
Compressor summary: The paper proposes a ViT-based ReID framework that fuses models trained on different aspect ratios, improving performance on vehicle re-identification tasks with non-square inputs.
http://arxiv.org/abs/2407.07841v1
Compressor summary: The study benchmarks ten slide-level aggregation techniques for medical imaging and finds that domain-specific foundation models outperform generic ones, but no single model excels in all tasks.
http://arxiv.org/abs/2407.07840v1
Compressor summary: The proposed method DeCC can measure the reliability of VLM's answers by comparing their internal reasoning process and indirect answers obtained from sub-questions.
http://arxiv.org/abs/2407.07835v1
Compressor summary: The paper introduces RoBus, a large multimodal dataset for controllable generation of road networks and building layouts in 3D cities, incorporating urban characteristics and providing evaluation metrics.
http://arxiv.org/abs/2407.07829v1
Compressor summary: The paper proposes Gromov-Monge-Gap (GMG), a regularizer for unsupervised disentangled representation learning that leverages geometrical constraints to preserve the structure of distributions supported on different spaces, making the model decoder-free and more scalable.
http://arxiv.org/abs/2407.07827v1
Compressor summary: The paper investigates using convolutional neural networks to predict the stability number of random graphs from their graph images.
http://arxiv.org/abs/2407.07821v1
Compressor summary: The paper proposes a new method to measure the reliability of neural network predictions under data distribution shifts by clustering outputs and using distances between class centroids and incorrect predictions as a metric for confidence.
http://arxiv.org/abs/2407.07818v1
Compressor summary: The Misclassification Likelihood Matrix (MLM) is a new tool that helps assess how reliable neural networks are when predictions change due to distribution shifts and suggests ways to improve them.
http://arxiv.org/abs/2407.07816v1
Compressor summary: The paper reviews the recent developments and challenges in deep stereo matching, a field that has seen significant advancements in the last five years thanks to new architectures and paradigms.
http://arxiv.org/abs/2407.07810v1
Compressor summary: The study analyzes how large language models work by tracing token trajectories through transformer blocks and finds that increased alignment between singular vectors of Residual Jacobians positively correlates with model performance.
http://arxiv.org/abs/2407.07805v1
Compressor summary: SUMix is a novel data augmentation approach for deep learning that learns the mixing ratio and uncertainty of mixed samples to improve generalization ability and performance.
http://arxiv.org/abs/2407.07802v1
Compressor summary: ROSA improves parameter efficient fine-tuning for large models in natural language processing tasks by adapting subspaces of arbitrary dimension with zero latency overhead.
http://arxiv.org/abs/2407.07799v1
Compressor summary: The text discusses LAB, a benchmark for evaluating attribution in long document tasks, and finds that citation (response generation and evidence extraction in one step) mostly performs best.
http://arxiv.org/abs/2407.07796v1
Compressor summary: The text introduces a new benchmark for testing large language models on grid-based games, showing varying performance across different games and prompt types, and providing open-access data for analysis.
http://arxiv.org/abs/2407.07794v1
Compressor summary: The paper proposes a reinforcement learning method to sequentially collect measurements for solving under-determined inverse problems, aiming to recover the signal with fewer measurements and improved performance.
http://arxiv.org/abs/2407.07791v1
Compressor summary: The paper investigates security risks of large language models in multi-agent systems due to the spread of manipulated knowledge and proposes a two-stage attack method to exploit these vulnerabilities.
http://arxiv.org/abs/2407.07789v1
Compressor summary: RCM is a novel feature matching method that addresses scarcity of matchable points, matching conflicts, and keypoint-repeatability reliance issues by switching views, using a conflict-free coarse matching module, and integrating the semi-sparse paradigm and coarse-to-fine architecture.
http://arxiv.org/abs/2407.07780v1
Compressor summary: The paper proposes MGCAMT, a framework that aligns confidence across different levels to improve cross domain object detection using pseudo labels and Mean Teacher approach.
http://arxiv.org/abs/2407.07778v1
Compressor summary: Key points: - The paper explores how many and what kind of primitive actions (APIs) are needed for a versatile embodied agent using wikiHow tutorials as a source of instructions. - The paper proposes a framework that uses few-shot prompting to generate Pythonic programs as agent policies and bootstrap a universe of APIs by reusing and creating them. - The paper applies the framework on a small fraction of wikiHow tutorials and induces an action space of 300+ APIs, which are mostly not supported by existing embodied simulators. Summary: The paper investigates how to define a large action space for embodied agents using wikiHow tutorials and a few-shot prompting framework that generates Pythonic programs as policies and bootstrap APIs.
http://arxiv.org/abs/2407.07771v1
Compressor summary: The proposed framework combines multiple tasks to generate comprehensive prompt words that guide ChatGPT to create high-quality tweets, and uses ChatGPT to evaluate the generated content.
http://arxiv.org/abs/2407.07765v1
Compressor summary: This paper shows that differential privacy and online learning are related in general classification tasks, using Ramsey-type theorems for trees.
http://arxiv.org/abs/2407.07764v1
Compressor summary: The paper proposes PosFormer, a position forest transformer for handwritten mathematical expression recognition that uses a forest structure to model the position and hierarchy of symbols and an implicit attention correction module to improve performance on complex datasets.
http://arxiv.org/abs/2407.07763v1
Compressor summary: The paper proposes a framework that enables efficient medical image segmentation by delivering semantic and domain knowledge between labeled and unlabeled data, improving performance on various challenging scenarios.
http://arxiv.org/abs/2407.07760v1
Compressor summary: This paper presents a robust video object segmentation framework with spatial-semantic features and discriminative object queries that achieves state-of-the-art performance on multiple datasets.
http://arxiv.org/abs/2407.07740v1
Compressor summary: The Lane Safety Metric (LSM) is a new method to assess the safety of lane detection systems for autonomous vehicles by considering factors like object detection, road type, and vehicle speed.
http://arxiv.org/abs/2407.07737v1
Compressor summary: The paper compares two methods for training large language models with user-level differential privacy, showing that one method performs better in different scenarios depending on the number of examples per user and the desired privacy level.
http://arxiv.org/abs/2407.07735v1
Compressor summary: The paper introduces NeRFProtector, a tool that allows NeRF creators to embed binary messages in their 3D scene representations while maintaining performance quality.
http://arxiv.org/abs/2407.07726v1
Compressor summary: PaliGemma is an open VLM that excels at various tasks thanks to its versatile vision and language models.
http://arxiv.org/abs/2407.07712v1
Compressor summary: Deep-Graph-Sprints (DGS) is a fast and efficient deep learning architecture for representing interconnected, evolving systems on continuous-time dynamic graphs (CTDGs).
http://arxiv.org/abs/2407.07674v1
Compressor summary: The paper explores using active learning to train deep neural networks as surrogate models for scientific simulations, reducing the need for extensive and expensive simulation data.
http://arxiv.org/abs/2407.07673v1
Compressor summary: The paper proposes a new framework for Semi-Supervised Temporal Action Localization (SS-TAL) that improves pseudo-label selection by jointly learning classification confidence and localization reliability, eliminating ambiguous positives, and enhancing action discrimination.
http://arxiv.org/abs/2407.07671v1
Compressor summary: The text discusses the challenges and risks of AI making moral decisions, as ethics lacks a precise mathematical framework and human moral decision-making is imperfect.
http://arxiv.org/abs/2407.07668v1
Compressor summary: The text discusses how to reduce catastrophic forgetting in machine learning models using memory and predictive uncertainty measures.
http://arxiv.org/abs/2407.07667v1
Compressor summary: VEnhancer is a framework that improves low-quality generated videos by adding details and removing artifacts using a conditioned video diffusion model and a video ControlNet.
http://arxiv.org/abs/2407.07666v1
Compressor summary: S.C.O.R.E. is a 5-aspect framework to evaluate large language models in healthcare based on safety, consensus, objectivity, reproducibility, and explainability.
http://arxiv.org/abs/2407.07664v1
Compressor summary: HPL is an approach to representation learning that uses class prototypes on the unit hypersphere to bias representations for scale invariant and known geometry, with improved optimisation procedure and prototype placement.
http://arxiv.org/abs/2407.07662v1
Compressor summary: The paper proposes a new method to remove hidden triggers from machine learning models, which can cause them to behave unexpectedly, by using unseen data samples to adjust the model's weights.
http://arxiv.org/abs/2407.07660v1
Compressor summary: The paper proposes a registration-guided consistency approach with disentanglement learning for medical image synthesis, which improves alignment and preserves anatomical structures.
http://arxiv.org/abs/2407.07655v1
Compressor summary: The authors propose a selective $G$-Bispectrum that reduces computational cost and improves accuracy and robustness in deep neural networks for achieving group-invariance.
http://arxiv.org/abs/2407.07639v1
Compressor summary: The paper evaluates two methods for providing explanations for similarity search in graph data using GNNs and shows that gradient-based explanations have desirable properties such as actionability, consistency, and sparsity.
http://arxiv.org/abs/2407.07638v1
Compressor summary: Key points: - VLMs learn image-text representations and use prompt learning to adapt to downstream tasks - Prompt learning needs true labels, but candidate labels are often available due to privacy or sensitivity issues - The paper proposes a framework that disambiguates candidate labels and leverages VLM's prior knowledge - The framework improves the robustness of prompt learning with candidate labels Summary: The paper introduces a framework for prompt learning with candidate labels for vision-language models, which aligns model output with mixed class posterior and uses various training objectives to improve performance.
http://arxiv.org/abs/2407.07631v1
Compressor summary: The paper develops sample-efficient risk-sensitive offline reinforcement learning algorithms for linear Markov Decision Processes using entropic risk measure.
http://arxiv.org/abs/2407.07630v1
Compressor summary: The article reviews challenges of using web-mined data for pre-training large language models and suggests ways to improve their accuracy, reliability, and ethical responsibility.
http://arxiv.org/abs/2407.07627v1
Compressor summary: The paper explores using image-to-image translation to make 3D-rendered facial images more realistic and improve face recognition systems' performance on real-world data.
http://arxiv.org/abs/2407.07617v1
Compressor summary: The article discusses an annotation system for humor in texts based on readers' self-paced reading behavior and a related psycho-linguistic experiment.
http://arxiv.org/abs/2407.07616v1
Compressor summary: Key points: - The paper tackles change detection and semantic segmentation with satellite image time series (SITS-SCD) - It proposes a new architecture that improves over the state of the art and leverages long-term temporal information - It investigates the impact of spatial and temporal shifts on SITS datasets using DynamicEarthNet and MUDS - It finds that spatial domain shift is the most complex setting and temporal shift affects change detection more than semantic segmentation Summary: The paper presents a new method for detecting changes and identifying objects in satellite images over time, and studies how different types of shifts in data affect its performance.
http://arxiv.org/abs/2407.07614v1
Compressor summary: MARS is a new framework for generating images from text that combines pre-trained language models with visual understanding, enabling bilingual and efficient image synthesis.
http://arxiv.org/abs/2407.07613v1
Compressor summary: The paper introduces a probabilistic learning rate scheduler (PLRS) that does not follow the typical monotonically decreasing rule and proves its convergence, while also showing competitive performance in experiments.
http://arxiv.org/abs/2407.07612v1
Compressor summary: The study shows that large transformer models can learn causal reasoning from passive data and generalize well to new scenarios by training on multiple demonstrations of causal axioms.
http://arxiv.org/abs/2407.07611v1
Compressor summary: The text proposes physics-informed geometric operators (GOs) for improving performance prediction, dimension reduction, and generative models using high-level intrinsic geometric information and physics in the feature vector.
http://arxiv.org/abs/2407.07606v1
Compressor summary: The paper reviews computational models of construction grammar learning, synthesizes existing methodologies and results, identifies challenges and opportunities for future research.
http://arxiv.org/abs/2407.07605v1
Compressor summary: The researchers developed a smartphone app that uses computer vision to automatically recognize and distinguish wounds on elderly patients' skin.
http://arxiv.org/abs/2407.07604v1
Compressor summary: The H-FCBFormer model uses a Vision Transformer and Fully Convolutional Network ensemble to accurately detect occlusal contacts in dentistry, improving on other machine learning methods and outperforming human dentists.
http://arxiv.org/abs/2407.07603v1
Compressor summary: iiANET is a hybrid model that combines global self-attention, local convolutions, and channel attention to capture long-range dependencies in complex images, outperforming some state-of-the-art models.
http://arxiv.org/abs/2407.07596v1
Compressor summary: The authors propose a framework to design randomized allocation rules for social programs that balance allocating resources to high-need individuals with evaluating the program's effectiveness, and demonstrate its benefits using data from human services in Allegheny County, Pennsylvania.
http://arxiv.org/abs/2407.07587v1
Compressor summary: The paper presents Let Occ Flow, a self-supervised method for predicting 3D occupancy and flow using only camera inputs, which outperforms existing methods on nuScenes and KITTI datasets.
http://arxiv.org/abs/2407.07586v1
Compressor summary: The paper explores simple methods for source-free object detection adaptation and shows that adapting batch statistics and using a modified Mean Teacher with strong-weak augmentation can outperform previous approaches.
http://arxiv.org/abs/2407.07582v1
Compressor summary: TIP is a novel framework for learning multimodal representations robust to incomplete tabular data, using self-supervised learning strategies and a versatile encoder.
http://arxiv.org/abs/2407.07580v1
Compressor summary: InstructLayout is a new framework for creating 2D and 3D layouts from natural language instructions with better controllability and fidelity, using a semantic graph prior and a layout decoder, and it outperforms existing methods in various tasks.
http://arxiv.org/abs/2407.07577v1
Compressor summary: The paper introduces a new model (IDA-VLM) and benchmark (MM-ID) to improve the ability of large vision-language models to recognize and associate instance identities across different scenes, which is crucial for understanding complex visual content like movies.
http://arxiv.org/abs/2407.07575v1
Compressor summary: The paper proposes a method to optimize resource allocation for vehicular edge computing networks using multi-agent deep reinforcement learning, considering delays caused by digital twin maintenance and computational processing.
http://arxiv.org/abs/2407.07566v1
Compressor summary: HebDB is a large dataset for Hebrew speech processing with raw recordings and pre-processed versions, along with two baseline ASR systems that outperform current multi-lingual alternatives.
http://arxiv.org/abs/2407.07565v1
Compressor summary: The paper investigates how code generation test sets can contaminate large language models and identifies three sources of contamination using a new dataset of python prompts and solutions.
http://arxiv.org/abs/2407.07564v1
Compressor summary: DiTAC is a new trainable activation function that enhances the expressiveness and performance of deep neural nets in various tasks.