This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-09 generated by the compressor, my personal LLM-based project.
http://arxiv.org/abs/2407.06192v1
Compressor summary: This work studies how large vision language models hallucinate multiple objects in images, introduces a new evaluation protocol called ROPE, and identifies factors affecting these behaviors.
http://arxiv.org/abs/2407.06191v1
Compressor summary: Tailor3D is a novel pipeline that creates customized 3D objects from dual-side images by emulating a tailor's ability to edit and stitch together the front and back of garments.
http://arxiv.org/abs/2407.06190v1
Compressor summary: SuperFlow is a novel framework that uses consecutive LiDAR-camera pairs to learn spatiotemporal features for accurate 3D perception in autonomous driving, reducing human annotations and improving performance across 11 heterogeneous datasets.
http://arxiv.org/abs/2407.06189v1
Compressor summary: Video-STaR is a self-training approach that leverages existing labeled video datasets for improving large vision language models' performance in various tasks.
http://arxiv.org/abs/2407.06188v1
Compressor summary: CrowdMoGen is a text-driven framework that uses a large language model to generate realistic and flexible crowd motions in various scenarios without paired training data.
http://arxiv.org/abs/2407.06187v1
Compressor summary: Jedi is a finetuning-free text-to-image model that uses reference images to generate personalized images with high quality and ease.
http://arxiv.org/abs/2407.06183v1
Compressor summary: The text discusses how curvature information affects learning rate tuning in deep learning, introduces a new method (CDAT) that stabilizes curvature better than classical methods, and shows CDAT's benefits in full batch and mini-batch regimes.
http://arxiv.org/abs/2407.06178v1
Compressor summary: The authors propose a method to identify snake species from images using Meta's DINOv2 vision transformer model and achieve promising results.
http://arxiv.org/abs/2407.06177v1
Compressor summary: The authors create a culture-centric benchmark to evaluate vision-language models' captioning abilities for visually impaired people in diverse settings, identifying challenges like hallucination and misalignment of automatic metrics with human judgment.
http://arxiv.org/abs/2407.06174v1
Compressor summary: This paper surveys deepfake generation and detection techniques, stressing the importance of effective countermeasures and data diversity to combat misinformation and fraud.
http://arxiv.org/abs/2407.05872v1
Compressor summary: The paper explores different parameterizations and optimizers for scaling models and proposes a new Adam variant called Adam-atan2 that avoids gradient underflow.
http://arxiv.org/abs/2407.05869v1
Compressor summary: PORCA is a new RCA framework that can handle missing data and system heterogeneity, improving reliability in identifying root causes of faults.
http://arxiv.org/abs/2407.05868v1
Compressor summary: The authors propose an automated method to create false premise questions based on knowledge graphs, and introduce a large-scale benchmark for evaluating language models' vulnerability to factuality hallucination.
http://arxiv.org/abs/2407.05864v1
Compressor summary: The text discusses using neural networks to assign weights to states in an information set for better gameplay in Reconnaissance Blind Chess, and shows that a Siamese network outperforms a convolutional one.
http://arxiv.org/abs/2407.05862v1
Compressor summary: The paper proposes Point-CMAE, a method that combines masked autoencoder and contrastive learning for 3D point cloud pretraining with ViTs, improving representation quality and transfer performance.
http://arxiv.org/abs/2407.05858v1
Compressor summary: Mllm-npu is a novel system that efficiently leverages on-device neural processing units to speed up and save energy for large language models on mobile devices.
http://arxiv.org/abs/2407.05848v1
Compressor summary: The authors propose a Wavelet Transform Convolution (WTConv) layer for CNNs that allows larger receptive fields with less parameters, improved robustness, and better shape recognition.
http://arxiv.org/abs/2407.05844v1
Compressor summary: The paper proposes a generalist segmentation model that combines anatomy and pathology information for improved medical image analysis, using a novel query-based transformer approach.
http://arxiv.org/abs/2407.05843v1
Compressor summary: The study explores the effect of Neural Collapse on bias in medical imaging, finding that it can cause a significant drop in performance when trained with biased data.
http://arxiv.org/abs/2407.05842v1
Compressor summary: The authors propose a new method to generate realistic 3D blood vessel networks using denoising diffusion models, which can handle complex structures like capillaries and the Circle of Willis.
http://arxiv.org/abs/2407.05841v1
Compressor summary: The paper proposes Constrained Word2Vec (CW2V), a simple and effective method for initializing language models' tokenizers when expanding them to new languages, without requiring cross-lingual embeddings.
http://arxiv.org/abs/2407.05816v1
Compressor summary: Graph Reasoning Networks (GRNs) combine fixed and learned graph representations with a differentiable solver to improve reasoning in graph-based machine learning.
http://arxiv.org/abs/2407.05814v1
Compressor summary: The paper proposes a cross-domain few-shot in-context learning method using multimodal large language models to improve traffic sign recognition for autonomous driving.
http://arxiv.org/abs/2407.05811v1
Compressor summary: The text describes a multimodal approach to predict ego vehicle trajectories using ResNet-50, IMU sensor data, and high-definition maps, which improves accuracy and reliability in urban environments.
http://arxiv.org/abs/2407.05810v1
Compressor summary: The study examines how using AI chatbots in a medical imaging course affects students' engagement, perception, and learning outcomes, highlighting both benefits and concerns.
http://arxiv.org/abs/2407.05795v1
Compressor summary: The paper proposes HyCIR, which uses synthetic labels to improve ZS-CIR, and introduces SynCir, a pipeline that generates image-text pairs based on visual similarity and semantic similarity.
http://arxiv.org/abs/2407.05793v1
Compressor summary: The paper proposes an online optimization algorithm for dynamically pricing complementary items with sequential display, sales constraint, and uncertainty, using a Markov Decision Process framework and online learning tools.
http://arxiv.org/abs/2407.05789v1
Compressor summary: The paper introduces a new DAC benchmark (CANDID) and shows that sequential policies perform better than independent learning in managing interdependent action dimensions with varying importance.
http://arxiv.org/abs/2407.05788v1
Compressor summary: Constrained Bayesian Optimization (CBO) minimizes energy use while maintaining model performance in machine learning.
http://arxiv.org/abs/2407.05786v1
Compressor summary: This paper investigates how Large Language Models perform in identifying domain-specific entities within case law documents and finds that Mistral and Gemma are the most effective models for this task.
http://arxiv.org/abs/2407.05781v1
Compressor summary: The text studies the benefits and challenges of using shared features for multiple robotic agents in dynamic environments with changing goals, and shows that representation learning can reduce regret in control tasks compared to single-task learning.
http://arxiv.org/abs/2407.05778v1
Compressor summary: The paper challenges the self-consistency principle in large language models and suggests that consistent answers derived from longer reasoning texts are more likely to be correct due to autonomous chain-of-thought reasoning.
http://arxiv.org/abs/2407.05775v1
Compressor summary: Key points: - The text is about automated incident response agents based on machine learning that can handle network structure changes. - The method uses relational agent learning with a message passing neural network to encode the network state as a graph. - The approach is evaluated on a cyber incident simulator and shows advantages over default vector representation in some cases. Summary: The text presents a machine learning method for automated incident response agents that can adapt to changing network structures using relational agent learning and graph-based encoding, and evaluates its performance on a simulation.
http://arxiv.org/abs/2407.05771v1
Compressor summary: Ref-MC2 is a new inverse rendering method that uses multi-time Monte Carlo sampling to accurately reconstruct reflective 3D objects with environmental illumination and inter-reflections, while reducing computational complexity and improving geometry accuracy.
http://arxiv.org/abs/2407.05769v1
Compressor summary: The proposed 3D object detection framework uses a Semantic-aware Multi-branch Sampling module and multi-view consistency constraints to improve performance, especially for low-performance backbones.
http://arxiv.org/abs/2407.05765v1
Compressor summary: Key points: - Deep models struggle with out-of-distribution (OOD) generalization - Invariant risk minimization (IRM) learns invariant features and minimizes risk across domains - ERM and IRM may fail if pseudo-invariant features have insufficient support overlap - The proposed method uses Bayesian random semantic data augmentation to increase sample diversity and improve OOD generalization Summary: The paper proposes a new method that combines IRM with Bayesian random semantic data augmentation to enhance deep models' ability to handle OOD situations by increasing feature support overlap.
http://arxiv.org/abs/2407.05750v1
Compressor summary: The paper shows that large language models can process text layouts and answer questions requiring spatial reasoning, and this ability comes from pretraining data and instruction tuning.
http://arxiv.org/abs/2407.05740v1
Compressor summary: Multilingual large language models (LLMs) reduce bias and improve prediction accuracy compared to monolingual LLMs.
http://arxiv.org/abs/2407.05736v1
Compressor summary: The TransMA model predicts the efficiency of ionizable lipid nanoparticles (LNPs) for delivering mRNA using a multi-modal molecular structure fusion architecture that captures both 3D spatial and 1D sequential features, potentially speeding up the LNP design process.
http://arxiv.org/abs/2407.05734v1
Compressor summary: The text explores how well chatbots can understand predicate symmetry, a human ability, using large language models and in-context learning, and compares their performance to humans.
http://arxiv.org/abs/2407.05733v1
Compressor summary: The study suggests combining large language models and comparative judgment for automated essay scoring, outperforming traditional methods.
http://arxiv.org/abs/2407.05732v1
Compressor summary: The authors propose FairPFN, a transformer model that learns to eliminate the causal effects of protected attributes on observational data, improving counterfactual fairness in machine learning systems.
http://arxiv.org/abs/2407.05726v1
Compressor summary: The authors propose a video-based, non-invasive gait analysis method for scoliosis classification using a large-scale dataset, Scoliosis1K, and develop an enhanced model, ScoNet-MT, with promising diagnostic accuracy.
http://arxiv.org/abs/2407.05721v1
Compressor summary: PsycoLLM is a specialized psychological large language model trained on high-quality data and evaluated on psychological counseling exams, showing better performance than other LLMs.
http://arxiv.org/abs/2407.05718v1
Compressor summary: The paper introduces DoGe, a novel method for dialogue generation that alternates between internal and external knowledge to balance factuality and diversity without relying on randomness.
http://arxiv.org/abs/2407.05714v1
Compressor summary: The article describes how an industrial company used knowledge engineering methods to create a system, "SARBANES", that supports its business processes and preserves its expertise in a nuclear and defense context.
http://arxiv.org/abs/2407.05713v1
Compressor summary: The paper introduces SOIA-DOD, a method that detects active objects and predicts their interactions in egocentric videos, achieving state-of-the-art performance.
http://arxiv.org/abs/2407.05712v1
Compressor summary: MobilePortrait is a lightweight method for real-time neural head avatars animation on mobile devices using mixed representation of keypoints and precomputed visual features.
http://arxiv.org/abs/2407.05705v1
Compressor summary: The paper proposes a fast framework for continuous learning in knowledge graphs that preserves old knowledge and efficiently learns new knowledge using incremental low-rank adapters.
http://arxiv.org/abs/2407.05704v1
Compressor summary: The paper proposes APO-MVP, an algorithm for learning in adversarial MDPs with an oblivious adversary and a reward function revealed at the end of each episode, achieving a regret bound of $\mathcal{O}(\text{poly}(H)\sqrt{SAT})$ and avoiding occupancy measures.
http://arxiv.org/abs/2407.05703v1
Compressor summary: The paper introduces a new method (LGRNet) for segmenting uterine fibroids in ultrasound videos, which can help detect and treat them early to prevent malignant transformations.
http://arxiv.org/abs/2407.05700v1
Compressor summary: The paper proposes INVERSE-INSTRUCT, a method that improves instruction-tuned code LLMs by generating instructions from code snippets instead of translating natural language to code, resulting in better performance on various benchmarks.
http://arxiv.org/abs/2407.05694v1
Compressor summary: This essay explores the esoteric governance tool of compute thresholds, questions their effectiveness in mitigating risk, and suggests alternative approaches.
http://arxiv.org/abs/2407.05693v1
Compressor summary: Sub-SA is a method that uses submodules and reward-penalty regularization to reduce annotation costs while improving in-context learning example quality.
http://arxiv.org/abs/2407.05690v1
Compressor summary: TransAct is a structured pruning method that reduces computational overheads of large language models by pruning intra-module activations while preserving inter-module ones, achieving high compression with low efficiency and performance loss.
http://arxiv.org/abs/2407.05688v1
Compressor summary: The paper proposes a novel method called LA-CMFER to improve cross-multidomain facial expression recognition by addressing inter- and intra-domain shifts using dual-level and multi-view alignment techniques.
http://arxiv.org/abs/2407.05682v1
Compressor summary: The paper proposes a new framework, Retrieved In-Context Principles (RICP), which uses the teacher model to generate reasons and insights from student model mistakes to improve in-context learning for reasoning tasks.
http://arxiv.org/abs/2407.05680v1
Compressor summary: Our method reconstructs high-fidelity hand models with intricate textures using inverse rendering, Graph Convolutional Networks, and mesh-based neural rendering.
http://arxiv.org/abs/2407.05679v1
Compressor summary: BEVWorld is a novel approach that encodes multimodal sensor inputs into a unified latent space for environment modeling in autonomous driving, enabling future scenario prediction and improving downstream tasks.
http://arxiv.org/abs/2407.05674v1
Compressor summary: Key points: - Programming agents that conform to policies is challenging and requires consistent, accurate, and relevant information - LLMs generate unfounded responses and dialogue trees are brittle - KITA is a framework for creating task-oriented agents with reliable grounded responses and controllable policies - KITA outperforms GPT-4 on accuracy, dialogue act, and goal completion rate in user study Summary: KITA is a programmable framework that creates task-oriented agents with reliable and controllable policies, unlike LLMs or dialogue trees, and shows superior performance in user study.
http://arxiv.org/abs/2407.05671v1
Compressor summary: The Multiscale Transformer model predicts missing values in motion forecasting for autonomous driving, improving accuracy and continuity by using multiscale attention and adaptive continuity representation.
http://arxiv.org/abs/2407.05666v1
Compressor summary: CP_NeRF improves NeRF's view rendering by adding depth and normal dense completion priors, using sparse data to guide ray sampling and construct a normal loss function for better training accuracy.
http://arxiv.org/abs/2407.05657v1
Compressor summary: Our novel approach for cross-domain few-shot action recognition uses a ResNet18 backbone and two branches for meta-training, integrating insights from labeled source data and unlabeled target data with domain encoders and dual distillation, improving generalization.
http://arxiv.org/abs/2407.05656v1
Compressor summary: The paper proposes using random circular vectors for XMC tasks, which improve performance and reduce computational cost.
http://arxiv.org/abs/2407.05650v1
Compressor summary: The Dynamic Net Architecture is a new intelligent-system architecture for vision that uses recurrence-stabilized networks to encode hierarchical feature representations, filter out irrelevant details, and generalize to unseen patterns.
http://arxiv.org/abs/2407.05649v1
Compressor summary: GRASS is a new graph neural network architecture that combines message passing, graph rewiring, and attention mechanisms to enhance information propagation and achieve state-of-the-art performance.
http://arxiv.org/abs/2407.05647v1
Compressor summary: MF-Adapter improves few-shot image classification by combining low-level and high-level features using Meta-Feature Units that measure local similarity.
http://arxiv.org/abs/2407.05645v1
Compressor summary: OneDiff is a novel generalist image difference captioning model that uses a robust vision-language architecture to accurately describe fine-grained variations between images and outperforms existing state-of-the-art models.
http://arxiv.org/abs/2407.05639v1
Compressor summary: The paper proposes a fusion model combining Isolation Forest, GAN, and Transformer to improve anomaly detection and log analysis in network security tasks.
http://arxiv.org/abs/2407.05638v1
Compressor summary: The novel HPFF model improves deep learning by combining hierarchical locally supervised learning and patch-level feature computation, achieving state-of-the-art performance on various image datasets.
http://arxiv.org/abs/2407.05633v1
Compressor summary: AdaPI is a novel approach for private inference on encrypted data that adapts to different edge devices' energy budgets and improves test accuracy by 7.3% on CIFAR-100.
http://arxiv.org/abs/2407.05627v1
Compressor summary: The paper discusses sentiment analysis with limited data for classifying positive, negative, or neutral opinions on a text dataset about Kaesang Pangarep's appointment, using F1-score as the metric.
http://arxiv.org/abs/2407.05623v1
Compressor summary: The paper proposes a Momentum Auxiliary Network (MAN) that improves information transfer between local blocks in deep neural networks, reducing GPU memory usage and increasing accuracy on image classification tasks.
http://arxiv.org/abs/2407.05622v1
Compressor summary: The paper studies how hard it is to learn sparse functions using gradient algorithms, introduces a new type of Statistical Queries called $\mathsf{DLQ}$ to model this process, and shows that the query complexity depends on the loss function used.
http://arxiv.org/abs/2407.05616v1
Compressor summary: ESCOUTER is a visually explainable classifier that provides transparent insights into deep learning models' decision-making process by incorporating explanations into the final confidence scores and offering positive or negative explanations for all categories.
http://arxiv.org/abs/2407.05615v1
Compressor summary: The paper proposes OSN, a framework that learns all plausible 3D scene configurations from monocular RGB videos using an object scale network and a joint optimization module.
http://arxiv.org/abs/2407.05611v1
Compressor summary: GenFollower is a novel approach that uses large language models to model car-following behaviors more accurately, interpreting factors influencing them, and improving traffic management and autonomous driving systems.
http://arxiv.org/abs/2407.05610v1
Compressor summary: This paper introduces a new task called described spatial-temporal video detection (DSTVD) that can handle multiple objects in language descriptions of videos, and presents a new benchmark dataset DVD-ST to evaluate it.
http://arxiv.org/abs/2407.05609v1
Compressor summary: The paper proposes a novel method called X-MLClass for open-world multi-label text classification under extremely weak supervision, which utilizes dominant keyphrases to discover and improve label space coverage and accuracy.
http://arxiv.org/abs/2407.05603v1
Compressor summary: Key points: - Whole slide imaging (WSI) is used for carcinoma diagnosis and prognosis, but requires experienced pathologists and is time-consuming - WSI-VQA is a novel framework that uses generative visual question answering to interpret WSIs and perform various tasks such as grading, prediction, and subtyping - The WSI-VQA dataset contains 8672 slide-level question-answering pairs with 977 WSIs and the W2T model outperforms existing discriminative models in medical correctness Summary: WSI-VQA is a framework that uses generative visual question answering to interpret whole slide images for carcinoma diagnosis and prognosis, using a dataset with 8672 pairs and a model that is more medically accurate than existing ones.
http://arxiv.org/abs/2407.05600v1
Compressor summary: GenArtist is a unified image generation and editing system using a multimodal large language model agent to coordinate existing models, plan procedures, and perform various tasks efficiently and reliably.
http://arxiv.org/abs/2407.05599v1
Compressor summary: The study develops large language models that can automatically detect and correct climate misinformation using a "truth sandwich" structure and various prompting strategies.
http://arxiv.org/abs/2407.05597v1
Compressor summary: GeoNLF is a hybrid method that combines neural reconstruction and geometric pose optimization for LiDAR point cloud synthesis, improving performance on sparse-view inputs.
http://arxiv.org/abs/2407.05594v1
Compressor summary: SLIM reduces spurious correlations in deep learning with minimal human input by using attention labeling and feature balancing, improving model reliability and efficiency.
http://arxiv.org/abs/2407.05593v1
Compressor summary: UnmaskingTrees is a tool for creating tables and filling in missing data using trees that gradually reveal features.
http://arxiv.org/abs/2407.05592v1
Compressor summary: The paper compares transfer learning and self-supervised learning in the medical field, evaluating their performance and robustness on small datasets with common medical issues.
http://arxiv.org/abs/2407.05591v1
Compressor summary: CAT is a new language model architecture that combines convolutional filters with attention, improving recall, copying, length generalization, and summarization tasks.
http://arxiv.org/abs/2407.05586v1
Compressor summary: D2RF is a method to restore sharp novel views from defocused monocular videos by modeling and removing depth-induced blur using layered Depth-of-Field volume rendering.
http://arxiv.org/abs/2407.05580v1
Compressor summary: E^2CFD is a framework that uses a large language model to generate cost functions for safe reinforcement learning, improving policy performance in various safety scenarios.
http://arxiv.org/abs/2407.05578v1
Compressor summary: FALIP improves CLIP's zero-shot performance in various tasks by adjusting its attention without altering the original image information.
http://arxiv.org/abs/2407.05576v1
Compressor summary: ORMNet is a novel end-to-end model for egocentric hand-object segmentation that leverages hand-guided attention and object relation decoupling to improve accuracy and reduce ambiguity.
http://arxiv.org/abs/2407.05577v1
Compressor summary: The paper proposes a method to edit talking face images with different emotions using audio-to-landmark and landmark-based editing modules, resulting in high-quality videos.
http://arxiv.org/abs/2407.05575v1
Compressor summary: The paper introduces a new benchmark dataset for reflective object detection with diverse images and annotations, revealing the limitations of existing methods in this area.
http://arxiv.org/abs/2407.05573v1
Compressor summary: The paper proposes a new method to predict human activities based on observed skeleton data using spatio-temporal encoding and decoding, outperforming some existing methods.
http://arxiv.org/abs/2407.05566v1
Compressor summary: The GMC framework is a general approach for multistage context learning and utilization in visual detection tasks that enhances performance and adaptability with user-defined configurations and diverse network architectures.
http://arxiv.org/abs/2407.05563v1
Compressor summary: LLMBox is a library that simplifies the development, use, and evaluation of large language models by providing a unified data interface, comprehensive evaluation, and user-friendly efficiency.
http://arxiv.org/abs/2407.05562v1
Compressor summary: The paper proposes a novel method to improve scene text recognition by enriching character features and addressing large intra-class variance and small inter-class variance issues.
http://arxiv.org/abs/2407.05557v1
Compressor summary: $R^2$-Guard is a robust reasoning-enabled guardrail model for large language models that uses knowledge-enhanced logical reasoning to capture intercorrelations among safety categories, achieving better performance and resilience than existing methods.
http://arxiv.org/abs/2407.05554v1
Compressor summary: PANS is a novel system that uses Monte-Carlo methods, depth-based motion inference, and bronchial semantic analysis to accurately and robustly localize bronchoscopes in real-time for pulmonary interventions.
http://arxiv.org/abs/2407.05553v1
Compressor summary: The paper proposes a method to predict skin color with foundation using two images and calibration with a color checker target.
http://arxiv.org/abs/2407.05552v1
Compressor summary: Ada-Adapter is a novel framework for few-shot style personalization of diffusion models that enables efficient zero-shot style transfer with limited source images and text prompts, achieving high-quality artistic stylizations.
http://arxiv.org/abs/2407.05551v1
Compressor summary: ReWaS is a novel video-and-text-to-sound generation method that uses video to control the synthesis of audio from text, allowing for flexible and high-quality sound production.
http://arxiv.org/abs/2407.05546v1
Compressor summary: The paper introduces Image Content Appeal Assessment (ICAA), a new metric that measures how appealing an image's content is to viewers, and shows its effectiveness in different domains.
http://arxiv.org/abs/2407.05547v1
Compressor summary: The paper proposes LaSe-E2V, a framework that uses language to guide event-to-video reconstruction, achieving semantic-aware high-quality results by combining event data with text-conditional diffusion models and an Event-guided Spatiotemporal Attention module.
http://arxiv.org/abs/2407.05540v1
Compressor summary: GTP-4o is a novel method that uses a graph-based approach to handle multi-modal information in biomedical domains, including completing missing modalities and aggregating cross-modal data.
http://arxiv.org/abs/2407.05538v1
Compressor summary: The paper presents translations between Normal Logic Programs and SETAFs, showing their semantic and structural equivalence, and RFALPs as an expressive subclass of NLPs.
http://arxiv.org/abs/2407.05528v1
Compressor summary: The paper explores unsupervised contrastive learning for detecting out-of-distribution samples in web-crawled data, but finds that it misses some clean examples and proposes a hybrid approach with a small-loss algorithm to improve image classification.
http://arxiv.org/abs/2407.05527v1
Compressor summary: The paper introduces the image squeeze connection, a new method that improves image synthesis quality and reduces parameters in StyleGAN models, by analyzing and addressing the limitations of the image skip connection technique.
http://arxiv.org/abs/2407.05526v1
Compressor summary: AI machines make decisions based on true facts and objective probabilities to achieve the best outcomes.