arxiv compressed, 2024-03-13

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-03-13 generated by the compressor, my personal LLM-based project.


Beyond Text: Frozen Large Language Models in Visual Signal Comprehension

http://arxiv.org/abs/2403.07874v1

Compressor summary: The authors propose a method to translate images into words using a large language model without fine-tuning, enabling image comprehension and denoising tasks.


Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

http://arxiv.org/abs/2403.07872v1

Compressor summary: The paper proposes an RWQ-Elo rating system for assessing large language models based on a two-player competitive format using real-world questions and demonstrates its advantages over MCQA.


Exploring Safety Generalization Challenges of Large Language Models via Code

http://arxiv.org/abs/2403.07865v1

Compressor summary: The paper introduces CodeAttack, a framework that tests the safety generalization of LLMs by transforming natural language inputs into code inputs, revealing common vulnerabilities in all studied models.


Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

http://arxiv.org/abs/2403.07860v1

Compressor summary: The paper introduces LaVi-Bridge, a pipeline that integrates diverse language and generative vision models for text-to-image generation, improving alignment and quality.


Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias

http://arxiv.org/abs/2403.07857v1

Compressor summary: The text discusses how model-induced distribution shifts can cause performance and fairness issues in machine learning models, but also proposes a framework called algorithmic reparation to address these problems and promote equity.


Quantum Support Vector Machine for Prostate Cancer Detection: A Performance Analysis

http://arxiv.org/abs/2403.07856v1

Compressor summary: The study uses Quantum Support Vector Machine (QSVM) to improve prostate cancer detection, achieving comparable accuracy and increased sensitivity over classical SVM.


Distilling the Knowledge in Data Pruning

http://arxiv.org/abs/2403.07854v1

Compressor summary: Key points: - Data pruning can reduce model size and training time but may compromise accuracy - Knowledge distillation (KD) integrates soft predictions from a teacher network pre-trained on full data to guide the pruned student network - KD improves pruned models across datasets, pruning methods, and pruning fractions - There is a trade-off between the pruning factor and the optimal knowledge distillation weight - Smaller teachers may outperform larger ones for lower pruning fractions Summary: The paper shows that using knowledge distillation with data pruning can improve accuracy and suggests optimal parameters for different pruning regimes.


12 mJ per Class On-Device Online Few-Shot Class-Incremental Learning

http://arxiv.org/abs/2403.07851v1

Compressor summary: O-FSCIL is a lightweight, memory-efficient method for incrementally learning new classes from few examples on resource-constrained devices.


Iterative Graph Neural Network Enhancement via Frequent Subgraph Mining of Explanations

http://arxiv.org/abs/2403.07849v1

Compressor summary: EEGL is an iterative algorithm that improves GNNs for node classification by using explanations from subgraph mining to obtain application-dependent features.


A Machine learning and Empirical Bayesian Approach for Predictive Buying in B2B E-commerce

http://arxiv.org/abs/2403.07843v1

Compressor summary: Udaan, India's largest B2B ecommerce platform, uses an ensemble of machine learning models to predict and optimize customer order patterns, resulting in a significant increase in order rates.


Quantifying and Mitigating Privacy Risks for Tabular Generative Models

http://arxiv.org/abs/2403.07842v1

Compressor summary: DP-TLDM is a novel tabular synthesizer that combines an autoencoder with a latent diffusion model to generate high-quality, privacy-preserving synthetic data.


MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric

http://arxiv.org/abs/2403.07839v1

Compressor summary: Key points: - The paper proposes a new pruning framework (MoPE-CLIP) for vision-language pre-trained models - MoPE metric assesses module importance by performance decline on cross-modal tasks - MoPE-CLIP reduces pre-training costs and achieves competitive task-specific performance Summary: The paper introduces a new pruning method (MoPE-CLIP) that uses a novel metric to measure the importance of modules in vision-language pre-trained models, leading to reduced pre-training costs and high task-specific performance.


The Missing Piece in Model Editing: A Deep Dive into the Hidden Damage Brought By Model Editing

http://arxiv.org/abs/2403.07825v1

Compressor summary: The paper proposes GORA, a new method to evaluate and measure the ripple effect in large language models, and SORA, a technique to mitigate this effect by selectively re-editing the model.


Label Dropout: Improved Deep Learning Echocardiography Segmentation Using Multiple Datasets With Domain Shift and Partial Labelling

http://arxiv.org/abs/2403.07818v1

Compressor summary: The paper proposes a new label dropout technique to improve the robustness of deep learning models for echocardiography segmentation when trained with multiple diverse partially-labelled datasets.


Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

http://arxiv.org/abs/2403.07816v1

Compressor summary: Branch-Train-MiX (BTX) is a method for training large language models with multiple specialized skills by branching a seed model, training experts in parallel, combining their feedforward parameters, and learning token-level routing.


Chronos: Learning the Language of Time Series

http://arxiv.org/abs/2403.07815v1

Compressor summary: Chronos is a framework for pretraining probabilistic time series models that use transformer-based language models and tokenized time series data, achieving high zero-shot performance on various forecasting tasks.


pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

http://arxiv.org/abs/2403.07809v1

Compressor summary: $ extbf{pyvene}$ is a Python library that allows customizable interventions on different PyTorch modules for various AI applications, including interpretability and robustness.


StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting

http://arxiv.org/abs/2403.07807v1

Compressor summary: StyleGaussian is a fast 3D style transfer method that uses 3D Gaussian Splatting, VGG features, and a K-nearest-neighbor-based 3D CNN to achieve high quality stylization with real-time rendering and multi-view consistency.


Beyond Memorization: The Challenge of Random Memory Access in Language Models

http://arxiv.org/abs/2403.07805v1

Compressor summary: Key points: - LMs are effective in NLP tasks, especially knowledge-intensive ones - Paper investigates how LMs access their memory sequentially or randomly - LMs can access memory sequentially but struggle with random access - Recitation and permutation techniques improve random access - Improved random access helps question answering Summary: The paper explores how LMs access their memory in different scenarios, and proposes recitation and permutation to enhance their random access, which benefits question answering.


A Fourier Transform Framework for Domain Adaptation

http://arxiv.org/abs/2403.07798v1

Compressor summary: The text proposes a Fourier method (FTF) to improve unsupervised domain adaptation by fusing low-level information from both domains and aligning multiple sources of data, achieving better generalization and performance on four benchmark datasets.


Joint Selection: Adaptively Incorporating Public Information for Private Synthetic Data

http://arxiv.org/abs/2403.07797v1

Compressor summary: Jam-pgm is a new technique that allows graphical model-based synthetic data generation to use public data, improving quality and outperforming other methods, even with biased public data distributions.


Fine-tuning Large Language Models with Sequential Instructions

http://arxiv.org/abs/2403.07794v1

Compressor summary: The paper proposes sequential instruction tuning to improve large language models' ability to follow multiple instructions in complex tasks, and analyzes its effects on various factors.


SemCity: Semantic Scene Generation with Triplane Diffusion

http://arxiv.org/abs/2403.07773v1

Compressor summary: SemCity generates realistic outdoor scenes using a 3D diffusion model with triplane representation and manipulation, improving performance on tasks like inpainting and semantic completion.


Transforming Competition into Collaboration: The Revolutionary Role of Multi-Agent Systems and Language Models in Modern Organizations

http://arxiv.org/abs/2403.07769v1

Compressor summary: The article discusses using large language models and multi-agent systems theory to create artificial agents that can simulate human interactions and support various organizational processes, overcoming some limitations of traditional approaches.


Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model

http://arxiv.org/abs/2403.07764v1

Compressor summary: Stable-Makeup is a novel diffusion-based method that transfers realistic and detailed makeup to user-provided faces, preserving content and structure, and showing strong robustness and generalizability for various tasks.


Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

http://arxiv.org/abs/2403.07750v1

Compressor summary: Our method synthesizes image-text pairs using LLMs and image generation models, improving VLM training efficiency and performance on image captioning tasks.


FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

http://arxiv.org/abs/2403.07747v1

Compressor summary: FineMath is a new benchmark dataset for evaluating Chinese LLMs' mathematical reasoning skills on diverse elementary school math problems with different difficulty levels.


Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception

http://arxiv.org/abs/2403.07746v1

Compressor summary: HyDRa is a novel camera-radar fusion architecture that improves depth prediction and 3D perception for autonomous driving in diverse conditions, achieving state-of-the-art results on nuScenes and Occ3D benchmarks.


Uncertainty Quantification with Deep Ensembles for 6D Object Pose Estimation

http://arxiv.org/abs/2403.07741v1

Compressor summary: The paper proposes a method to quantify uncertainty in multi-stage 6D object pose estimation using deep ensembles and evaluates it on SurfEmb, a top-performing approach.


DSEG-LIME -- Improving Image Explanation by Hierarchical Data-Driven Segmentation

http://arxiv.org/abs/2403.07733v1

Compressor summary: DSEG-LIME is a new method to improve image analysis by creating more accurate and consistent explanations of complex machine learning models using data-driven segmentation.


SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

http://arxiv.org/abs/2403.07726v1

Compressor summary: The paper describes SHROOM, a shared task on detecting inaccurate NLG outputs from 3 tasks, and reports trends and results from 42 teams participating in it.


Balancing Fairness and Accuracy in Data-Restricted Binary Classification

http://arxiv.org/abs/2403.07724v1

Compressor summary: The paper presents a framework to analyze how data restrictions affect the accuracy and fairness of Bayesian classifiers under various scenarios and fairness definitions.


On the Last-Iterate Convergence of Shuffling Gradient Methods

http://arxiv.org/abs/2403.07723v1

Compressor summary: Shuffling gradient methods, such as Random Reshuffle and Shuffle Once, have good empirical performance but lacked theoretical guarantees until now; researchers prove last-iterate convergence rates without strong convexity.


Multi-modal Auto-regressive Modeling via Visual Words

http://arxiv.org/abs/2403.07720v1

Compressor summary: Key points: - The paper introduces Large Multi-modal Models (LMMs) that combine text and image features for various vision tasks - The paper proposes visual words, which map visual features to text vocabulary, providing supervision information - The paper experiments with 5 VQA tasks and shows the effectiveness of the proposed approach Summary: The paper presents a novel LMM that uses visual words to supervise image modelling and achieves state-of-the-art results on 5 VQA tasks.


Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

http://arxiv.org/abs/2403.07719v1

Compressor summary: The authors propose a novel dynamic graph representation algorithm for histopathological whole slide images classification that captures both instance relationships and spatial interactions using a knowledge-aware attention mechanism.


WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

http://arxiv.org/abs/2403.07718v1

Compressor summary: The study measures large language models' abilities to interact with enterprise software using WorkArena benchmark and BrowserGym environment, finding promise but significant gaps and disparities.


StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

http://arxiv.org/abs/2403.07714v1

Compressor summary: StableToolBench is a benchmark for testing LLMs with external tools that uses a virtual API server, a caching system, and an automatic evaluator to ensure stability and fairness.


SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces

http://arxiv.org/abs/2403.07711v1

Compressor summary: Key points: - Diffusion models for video generation use attention layers but have memory limitations - State-space models (SSMs) are proposed as alternatives with linear memory consumption - SSMs achieve competitive results on UCF101 and MineRL Navigate datasets Summary: The paper proposes using state-space models for video generation instead of attention layers, which save memory and maintain performance.


Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

http://arxiv.org/abs/2403.07708v1

Compressor summary: The authors propose a penalty term called contrastive rewards to make reward models more effective in reinforcement learning from human feedback, which improves robustness, calibration, and performance.


Fast and Simple Explainability for Point Cloud Networks

http://arxiv.org/abs/2403.07706v1

Compressor summary: FBI is a fast and simple XAI method for point cloud data that enables better understanding of the network properties, online feedback, and improved classification explainability.


Robust Synthetic-to-Real Transfer for Stereo Matching

http://arxiv.org/abs/2403.07705v1

Compressor summary: The paper proposes a method to fine-tune stereo matching networks without losing their robustness to unseen domains by using pseudo labels and a dynamic framework.


Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning

http://arxiv.org/abs/2403.07704v1

Compressor summary: Symmetric Q-learning improves deep reinforcement learning by creating a Gaussian error distribution from skewed noise, increasing sample efficiency on continuous control tasks in MuJoCo.


CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers

http://arxiv.org/abs/2403.07700v1

Compressor summary: VoteCut is a new method that uses multiple self-supervised models to discover objects without labels and improves image segmentation with CuVLER, a zero-shot model trained on pseudo-labels.


Large, Small or Both: A Novel Data Augmentation Framework Based on Language Models for Debiasing Opinion Summarization

http://arxiv.org/abs/2403.07693v1

Compressor summary: The paper proposes a data augmentation framework for opinion summarization that uses both large and small language models to generate synthetic negative reviews and balance the sentiment distribution of the dataset.


Masked AutoDecoder is Effective Multi-Task Vision Generalist

http://arxiv.org/abs/2403.07692v1

Compressor summary: The paper proposes Masked AutoDecoder (MAD), a multi-task vision generalist that uses bi-directional attention and masked sequence modeling to unify different vision tasks in parallel, achieving better performance and efficiency than autoregressive models.


Reference-free Monolithic Preference Optimization with Odds Ratio

http://arxiv.org/abs/2403.07691v1

Compressor summary: The paper introduces ORPO, a reference model-free algorithm that improves language models by fine-tuning them with odds ratios, achieving better performance than state-of-the-art models.


Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

http://arxiv.org/abs/2403.07688v1

Compressor summary: The paper reassesses dying neurons in deep neural networks, showing that they can be useful for structured pruning and compression, and introduces Demon Pruning, a simple and effective method to control them.


Annotations on a Budget: Leveraging Geo-Data Similarity to Balance Model Performance and Annotation Cost

http://arxiv.org/abs/2403.07687v1

Compressor summary: Key points: - Current foundation models have imbalanced geographical and economic representation in their training data - More data from underrepresented countries is needed to improve model performance and reduce annotation costs - The paper proposes methods to identify the data to be annotated based on visual distinctiveness and similarity - The resulting lists of countries and topics are available online Summary: The paper presents methods to balance model performance and annotation costs by identifying and annotating data from underrepresented countries with visually distinctive and similar topics to current foundation models' training data.


Genuine Knowledge from Practice: Diffusion Test-Time Adaptation for Video Adverse Weather Removal

http://arxiv.org/abs/2403.07684v1

Compressor summary: The paper proposes a method for removing adverse weather conditions from videos using test-time adaptation and diffusion-based network, which improves generalization to unseen weather conditions.


MoralBERT: Detecting Moral Values in Social Discourse

http://arxiv.org/abs/2403.07678v1

Compressor summary: MoralBERT models capture moral nuances in text using annotated data from Twitter, Reddit, and Facebook, improving prediction accuracy compared to traditional methods.


Machine Learning for Soccer Match Result Prediction

http://arxiv.org/abs/2403.07669v1

Compressor summary: This chapter reviews machine learning methods for soccer match outcome prediction, highlighting current best-performing models, gaps in comparison of deep learning and Random Forest, and potential improvements in features and interpretability.


Scalable Spatiotemporal Prediction with Bayesian Neural Fields

http://arxiv.org/abs/2403.07657v1

Compressor summary: Bayesian Neural Field (BayesNF) is a statistical model combining a deep neural network and Bayesian inference for spatiotemporal data analysis, outperforming existing methods on large-scale climate and public health datasets.


Harder Tasks Need More Experts: Dynamic Routing in MoE Models

http://arxiv.org/abs/2403.07652v1

Compressor summary: The paper proposes a dynamic expert selection method for Mixture of Experts models that adjusts the number of activated experts based on input difficulty, improving efficiency and performance.


Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Matching Framework

http://arxiv.org/abs/2403.07636v1

Compressor summary: The paper proposes a novel VLP framework that dissects disease descriptions into aspects, aligns images with them, and improves detection of known and unknown diseases using a dual-head Transformer.


CardioGenAI: A Machine Learning-Based Framework for Re-Engineering Drugs for Reduced hERG Liability

http://arxiv.org/abs/2403.07632v1

Compressor summary: The paper introduces CardioGenAI, a machine learning framework that can re-engineer drugs to reduce their potential to cause heart problems by interfering with a specific ion channel, while maintaining their effectiveness.


Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation

http://arxiv.org/abs/2403.07630v1

Compressor summary: This paper proposes a method called CPAL to improve semantic segmentation by using context-aware prototypes that capture diverse object features and reduce knowledge bias between instances and contexts.


Multiple Latent Space Mapping for Compressed Dark Image Enhancement

http://arxiv.org/abs/2403.07622v1

Compressor summary: The study proposes a novel latent mapping network based on variational auto-encoder (VAE) to enhance compressed dark images while preserving texture details and avoiding compression artifacts amplification.


Smartphone region-wise image indoor localization using deep learning for indoor tourist attraction

http://arxiv.org/abs/2403.07621v1

Compressor summary: The paper proposes using deep learning to classify locations in smart museums and aquariums with smartphone images, achieving high precision and showing good feasibility for indoor tourism attractions.


Efficient Knowledge Deletion from Trained Models through Layer-wise Partial Machine Unlearning

http://arxiv.org/abs/2403.07611v1

Compressor summary: The paper introduces new machine unlearning algorithms that selectively erase knowledge from trained models while preserving performance and avoiding post fine-tuning.


Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image Generation

http://arxiv.org/abs/2403.07605v1

Compressor summary: NegOpt is a novel method that optimizes negative prompt generation for text-to-image models using supervised fine-tuning and reinforcement learning, improving image quality by 25% on average.


ProPML: Probability Partial Multi-label Learning

http://arxiv.org/abs/2403.07603v1

Compressor summary: \our{} is a new probabilistic method for PML that improves performance over existing methods, especially in noisy environments.


Unified Source-Free Domain Adaptation

http://arxiv.org/abs/2403.07601v1

Compressor summary: The paper proposes a novel approach called LCFD that discovers causal relationships between latent variables and model decisions for unified source-free domain adaptation, achieving state-of-the-art results.


Mondrian: On-Device High-Performance Video Analytics with Compressive Packed Inference

http://arxiv.org/abs/2403.07598v1

Compressor summary: Mondrian is a system that improves object detection on high-resolution videos by selectively processing relevant pixels and combining them efficiently on accelerators like GPUs, achieving higher accuracy and throughput than existing methods.


MinkUNeXt: Point Cloud-based Large-scale Place Recognition using 3D Sparse Convolutions

http://arxiv.org/abs/2403.07593v1

Compressor summary: MinkUNeXt is an efficient 3D place-recognition architecture based on 3D sparse convolutions that surpasses current methods without using complex proposals like Transformers or Attention-Layers.


Accurate Spatial Gene Expression Prediction by integrating Multi-resolution features

http://arxiv.org/abs/2403.07592v1

Compressor summary: TRIPLEX is a deep learning framework that predicts spatial gene expression from images, improving on current models and aiding in cancer diagnosis and treatment.


Robustifying and Boosting Training-Free Neural Architecture Search

http://arxiv.org/abs/2403.07591v1

Compressor summary: The text proposes a new algorithm (RoBoT) that combines and optimizes existing training-free metrics using Bayesian optimization to improve the search performance of neural network design, especially for diverse tasks.


PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution

http://arxiv.org/abs/2403.07589v1

Compressor summary: Key points: - Large kernel convnets have appealing performance but face challenges due to square complexity of convolution and proliferated parameters - The paper proposes peripheral convolution, inspired by human vision, that reduces parameter count and complexity of convolution - The paper also introduces PeLK, a large kernel network that outperforms modern vision Transformers and ConvNet architectures on various tasks Summary: The paper presents peripheral convolution, a novel CNN method based on human vision, and PeLK, a large kernel network that achieves superior performance on vision tasks with extremely large kernels.


Visual Privacy Auditing with Diffusion Models

http://arxiv.org/abs/2403.07588v1

Compressor summary: The paper studies how image reconstruction attacks on machine learning models depend on real-world image priors and suggests using diffusion models to assess privacy risks under differential privacy.


Perennial Semantic Data Terms of Use for Decentralized Web

http://arxiv.org/abs/2403.07587v1

Compressor summary: The authors propose a novel system that allows users to define and automate their data terms of use policies for decentralized web applications like Solid, ensuring better control over their data and improving privacy and usability.


LLMvsSmall Model? Large Language Model Based Text Augmentation Enhanced Personality Detection Model

http://arxiv.org/abs/2403.07581v1

Compressor summary: The paper proposes a method to detect personality traits in social media posts using a large language model, text augmentations, and contrastive learning, achieving better results than existing methods.


AACP: Aesthetics assessment of children's paintings based on self-supervised learning

http://arxiv.org/abs/2403.07578v1

Compressor summary: The paper proposes a self-supervised learning model for assessing children's paintings aesthetics, using a novel dataset with labeled attributes and outperforming other methods.


FPT: Fine-grained Prompt Tuning for Parameter and Memory Efficient Fine Tuning in High-resolution Medical Image Classification

http://arxiv.org/abs/2403.07576v1

Compressor summary: Fine-grained Prompt Tuning (FPT) is a novel method for medical image classification that reduces memory consumption by using a lightweight side network and fine-grained prompts to access pre-trained knowledge from large-scale models.


An Active Contour Model Driven By the Hybrid Signed Pressure Function

http://arxiv.org/abs/2403.07570v1

Compressor summary: The paper presents an active contour model with a hybrid signed pressure function that combines global and local information to improve image segmentation in complex environments.


Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

http://arxiv.org/abs/2403.07567v1

Compressor summary: The paper presents T2X, a new data-to-text dataset for isiXhosa, introduces the SSPG model for agglutinative languages, and evaluates various methods for generating text from data.


An Improved Strategy for Blood Glucose Control Using Multi-Step Deep Reinforcement Learning

http://arxiv.org/abs/2403.07566v1

Compressor summary: This paper proposes a novel DRL algorithm for blood glucose control that uses multi-step learning and Prioritized Experience Replay, achieving better results than benchmarks in time-in-range.


Unleashing Network Potentials for Semantic Scene Completion

http://arxiv.org/abs/2403.07560v1

Compressor summary: AMMNet is a novel framework for semantic scene completion that uses cross-modal modulation and adversarial training to improve feature learning and generalization from single-view RGB-D images.


SIFiD: Reassess Summary Factual Inconsistency Detection with LLM

http://arxiv.org/abs/2403.07557v1

Compressor summary: The study compares GPT-3.5 and GPT-4 for detecting inconsistencies in summaries and proposes SIFiD, a method to identify key sentences for inconsistency detection using LLMs.


Truth-Aware Context Selection: Mitigating the Hallucinations of Large Language Models Being Misled by Untruthful Contexts

http://arxiv.org/abs/2403.07556v1

Compressor summary: TACS is a method to help large language models generate better text by filtering out untruthful information from the input context.


Online Continual Learning For Interactive Instruction Following Agents

http://arxiv.org/abs/2403.07548v1

Compressor summary: The text proposes two realistic scenarios for learning embodied agents, and introduces Confidence-Aware Moving Average (CAMA) method to update logits without task boundary information.


SMURF: Continuous Dynamics for Motion-Deblurring Radiance Fields

http://arxiv.org/abs/2403.07547v1

Compressor summary: SMURF is a novel method that uses Neural-ODEs to model continuous camera motion and volumetric representation for synthesizing high-fidelity views with robustness to motion blur.


MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki

http://arxiv.org/abs/2403.07544v1

Compressor summary: The MAMMOTH toolkit is a framework for training modular multilingual machine translation systems efficiently across clusters of GPUs and is publicly available online.


A Survey of Vision Transformers in Autonomous Driving: Current Trends and Future Directions

http://arxiv.org/abs/2403.07542v1

Compressor summary: The text summarizes how visual transformer models, successful in natural language processing, are being adapted for autonomous driving tasks, such as object detection and scene recognition, due to their advantages in processing dynamic visual scenes.


LaB-GATr: geometric algebra transformers for large biomedical surface and volume meshes

http://arxiv.org/abs/2403.07536v1

Compressor summary: LaB-GATr is a transformer neural network that can effectively learn from large-scale medical 3D models by using geometric tokenisation, sequence compression and interpolation, achieving state-of-the-art results in cardiovascular hemodynamics modelling and neurodevelopmental phenotype prediction.


Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving

http://arxiv.org/abs/2403.07535v1

Compressor summary: The paper proposes a new fused depth estimation system that adaptively integrates multi-view and single-view results for robustness against noisy camera poses and challenging conditions.


Open-World Semantic Segmentation Including Class Similarity

http://arxiv.org/abs/2403.07532v1

Compressor summary: The paper presents a method for autonomous systems to identify and classify novel objects in real-world images without extra training data, enabling better decision-making in tasks like planning or mapping.


Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss

http://arxiv.org/abs/2403.07518v1

Compressor summary: The paper proposes Pseudo-OCR, an open-vocabulary text recognition framework that uses character detection and image inpainting to generate pseudo OOV training data from real images and a quality-aware margin loss to train with both real and pseudo data.


D4D: An RGBD diffusion model to boost monocular depth estimation

http://arxiv.org/abs/2403.07516v1

Compressor summary: The paper proposes a new method to generate realistic RGBD samples using Diffusion4D, which improves deep learning models' performance on monocular depth estimation tasks.


Uncertainty-guided Contrastive Learning for Single Source Domain Generalisation

http://arxiv.org/abs/2403.07514v1

Compressor summary: CUDGNet is a novel model that uses contrastive learning and domain generation to improve single domain generalization and provide uncertainty estimation.


Spatiotemporal Representation Learning for Short and Long Medical Image Time Series

http://arxiv.org/abs/2403.07513v1

Compressor summary: The paper proposes two methods for analyzing temporal patterns in medical data using deep learning, improving prognosis and diagnosis of conditions like AMD and cardiac output.


Relevance Score: A Landmark-Like Heuristic for Planning

http://arxiv.org/abs/2403.07510v1

Compressor summary: The paper proposes a novel "relevance score" for heuristic planning that identifies actions or facts important for most but not all plans to achieve a goal, and shows its improved performance compared to the standard landmark-based approach on problems without clear landmarks.


MoAI: Mixture of All Intelligence for Large Language and Vision Models

http://arxiv.org/abs/2403.07508v1

Compressor summary: MoAI is a new large language and vision model that uses auxiliary computer vision information for better real-world scene understanding in zero-shot tasks.


Constrained Optimal Fuel Consumption of HEV: A Constrained Reinforcement Learning Approach

http://arxiv.org/abs/2403.07503v1

Compressor summary: The paper proposes a mathematical expression for constrained optimal fuel consumption in hybrid electric vehicles using constrained reinforcement learning and compares two mainstream approaches, finding that Lagrangian-based methods achieve lower fuel consumption with more oscillations than variational policy optimization.


Detecting Security-Relevant Methods using Multi-label Machine Learning

http://arxiv.org/abs/2403.07501v1

Compressor summary: Dev-Assist is an IntelliJ IDEA plugin that uses multi-label machine learning to detect security-relevant methods in code and automatically configure static analysis tools with better performance than related approaches.


Block-wise LoRA: Revisiting Fine-grained LoRA for Effective Personalization and Stylization in Text-to-Image Generation

http://arxiv.org/abs/2403.07500v1

Compressor summary: LoRA is a new technique that improves personalization and stylization in text-to-image generation by fine-tuning different blocks of a diffusion model.


Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM

http://arxiv.org/abs/2403.07487v1

Compressor summary: Motion Mamba is a novel approach that combines Hierarchical Temporal Mamba and Bidirectional Spatial Mamba blocks to efficiently generate high-quality human motion sequences using state space models.


XpertAI: uncovering model strategies for sub-manifolds

http://arxiv.org/abs/2403.07486v1

Compressor summary: XpertAI is a framework that helps explain regression models by breaking them into range-specific sub-strategies and allowing precise queries as linear combinations of those sub-strategies.


A Deep Learning Approach to Diabetes Diagnosis

http://arxiv.org/abs/2403.07483v1

Compressor summary: The paper proposes a non-invasive diabetes diagnosis method using a neural network with batch normalization, data re-sampling and balancing, which improves accuracy compared to traditional machine learning methods.


Imbalance-aware Presence-only Loss Function for Species Distribution Modeling

http://arxiv.org/abs/2403.07472v1

Compressor summary: The study shows that using deep learning models with a balanced presence-only loss function can better model rare species in citizen science datasets, helping with biodiversity conservation.


A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

http://arxiv.org/abs/2403.07469v1

Compressor summary: Key points: - 3D dense captioning is a vision-language bridging task that generates detailed descriptions for 3D scenes - The paper reviews existing methods, provides a standard pipeline, introduces a taxonomy, and proposes future directions - The aim is to facilitate research and applications in multimedia and related domains Summary: The paper surveys 3D dense captioning, a task that creates accurate descriptions of 3D scenes, and presents a comprehensive review of methods, tools, challenges, and opportunities.


Experimental Comparison of Ensemble Methods and Time-to-Event Analysis Models Through Integrated Brier Score and Concordance Index

http://arxiv.org/abs/2403.07460v1

Compressor summary: The paper compares various prediction models for time-to-event analysis, evaluates their performance on three datasets, and explores how ensemble methods can improve accuracy and robustness.


A tutorial on multi-view autoencoders using the multi-view-AE library

http://arxiv.org/abs/2403.07456v1

Compressor summary: The authors propose a unified mathematical framework for multi-view autoencoders and extend the {multi-view-AE} Python library to provide consistent implementations and improve performance for modelling multi-modal data.


Proxy Methods for Domain Adaptation

http://arxiv.org/abs/2403.07442v1

Compressor summary: The paper proposes a method for domain adaptation under distribution shift using proximal causal learning and proxy variables, and shows its effectiveness in two settings: Concept Bottleneck and Multi-domain.


Matrix-Transformation Based Low-Rank Adaptation (MTLoRA): A Brain-Inspired Method for Parameter-Efficient Fine-Tuning

http://arxiv.org/abs/2403.07440v1

Compressor summary: MTLoRA is a new matrix transformation-based method for efficient fine-tuning of LPLMs that mimics brain function geometry to improve performance on NLU, NLG, and other downstream tasks.


Category-Agnostic Pose Estimation for Point Clouds

http://arxiv.org/abs/2403.07437v1

Compressor summary: The paper proposes a geometric feature method for object pose estimation that works without category information, achieving similar results to category-based methods.


JSTR: Joint Spatio-Temporal Reasoning for Event-based Moving Object Detection

http://arxiv.org/abs/2403.07436v1

Compressor summary: The paper proposes a new method for event-based moving object detection that uses joint spatio-temporal reasoning and improves accuracy by 13%.


Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

http://arxiv.org/abs/2403.07432v1

Compressor summary: The text proposes a novel framework that uses an event as a bridge between RGB and LiDAR sensors for fusing cross-modal knowledge in scene flow, improving visual and motion features.


DragAnything: Motion Control for Anything using Entity Representation

http://arxiv.org/abs/2403.07420v1

Compressor summary: DragAnything is a user-friendly motion control method for any object in video generation that uses entity representation and trajectory-based interaction.


Learning-Augmented Algorithms with Explicit Predictors

http://arxiv.org/abs/2403.07413v1

Compressor summary: The paper proposes online learning algorithms tailored for specific tasks like caching and scheduling, improving performance and robustness by integrating machine learning prediction into the algorithm design.


NightHaze: Nighttime Image Dehazing via Self-Prior Learning

http://arxiv.org/abs/2403.07408v1

Compressor summary: The paper proposes a novel nighttime image dehazing method using severe augmentation during training, which improves robustness to real-world degradations and achieves state-of-the-art performance.


In-context learning enables multimodal large language models to classify cancer pathology images

http://arxiv.org/abs/2403.07407v1

Compressor summary: In-context learning allows GPT-4V, a large vision language model, to perform well on three cancer histopathology tasks with minimal data and no task-specific fine-tuning.


FeTrIL++: Feature Translation for Exemplar-Free Class-Incremental Learning with Hill-Climbing

http://arxiv.org/abs/2403.07406v1

Compressor summary: FeTrIL++ is an improved framework for class-incremental learning that balances accuracy for new and past classes using oversampling techniques and dynamic optimization strategies.


Accelerated Inference and Reduced Forgetting: The Dual Benefits of Early-Exit Networks in Continual Learning

http://arxiv.org/abs/2403.07404v1

Compressor summary: The study explores how early-exit networks can be adapted for continual learning, improving efficiency and performance in class-incremental settings with a simple method called Task-wise Logits Correction (TLC).


From Canteen Food to Daily Meals: Generalizing Food Recognition to More Practical Scenarios

http://arxiv.org/abs/2403.07403v1

Compressor summary: The paper introduces two new benchmarks for food recognition from daily-life scenarios and a baseline method to improve transferability of existing methods.


Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs

http://arxiv.org/abs/2403.07398v1

Compressor summary: COM2 is a new dataset that helps language models improve their ability to reason about complex events by generating questions from a commonsense knowledge graph.


ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

http://arxiv.org/abs/2403.07392v1

Compressor summary: ViT-CoMer is a plain, pre-training-free, and feature-enhanced ViT backbone that improves dense prediction tasks by injecting spatial pyramid multi-receptive field convolutional features and proposing a simple CNN-Transformer bidirectional fusion interaction module.


Auxiliary CycleGAN-guidance for Task-Aware Domain Translation from Duplex to Monoplex IHC Images

http://arxiv.org/abs/2403.07389v1

Compressor summary: The paper proposes a new method for translating chromogenic immunohistochemistry images to fluorescence images using a novel training design and an auxiliary unpaired image domain, which improves segmentation performance compared to existing methods.


SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

http://arxiv.org/abs/2403.07384v1

Compressor summary: The SmallToLarge (S2L) method improves data efficiency in supervised fine-tuning for specialized domains by selecting data based on training trajectories from small models, achieving better results with less data and a smaller reference model.


Gabor-guided transformer for single image deraining

http://arxiv.org/abs/2403.07380v1

Compressor summary: The Gabformer combines CNNs and Transformers with Gabor filters to improve image deraining by enhancing local texture features and robustness to noise.


Hallmarks of Optimization Trajectories in Neural Networks and LLMs: The Lengths, Bends, and Dead Ends

http://arxiv.org/abs/2403.07379v1

Compressor summary: The paper proposes a new way to understand neural networks by analyzing their optimization trajectories and reveals how different optimization choices affect their behavior.


SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression

http://arxiv.org/abs/2403.07378v1

Compressor summary: SVD-LLM is a new LLM compression method that improves over existing SVD-based methods by using truncation-aware data whitening and layer-wise model parameter update to reduce compression loss and maintain accuracy.


NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

http://arxiv.org/abs/2403.07376v1

Compressor summary: This paper proposes NavCoT, a novel strategy to train large language models for vision-and-language navigation tasks by improving their navigational reasoning and interpretability using a chain-of-thought approach.


Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

http://arxiv.org/abs/2403.07372v1

Compressor summary: The ECFusion method addresses extrinsic and inherent cross-modal conflicts in 3D object detection by aligning spatial distributions and preserving objectness clues, leading to improved performance on the nuScenes dataset.


Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models

http://arxiv.org/abs/2403.07371v1

Compressor summary: The study presents a new diffusion-based method for virtual try-on that preserves clothing texture, retains user identity, and is significantly faster than existing approaches.


Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery

http://arxiv.org/abs/2403.07369v1

Compressor summary: TextGCD is a multi-modality framework that uses visual-language models to generate descriptive texts for images and leverages textual-visual disparities for novel visual category discovery, achieving superior performance over existing methods.


Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors

http://arxiv.org/abs/2403.07366v1

Compressor summary: The text introduces a new test-time adaptation method called DeYO that uses a novel confidence metric based on object shape to mitigate error accumulation in online updates.


A New Random Forest Ensemble of Intuitionistic Fuzzy Decision Trees

http://arxiv.org/abs/2403.07363v1

Compressor summary: The paper introduces intuitionistic fuzzy random forest, a new classification method that combines random forest and fuzzy logic, and shows its superior performance compared to existing algorithms.


Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning

http://arxiv.org/abs/2403.07362v1

Compressor summary: The text introduces a new evaluative approach for machine unlearning that identifies the worst-case data subset to erase, using bi-level optimization and experiments on various datasets and models.


FSC: Few-point Shape Completion

http://arxiv.org/abs/2403.07359v1

Compressor summary: The Few-point Shape Completion (FSC) model uses a dual-branch feature extractor and a two-stage revision network to recover 3D shapes from very sparse point clouds, outperforming previous methods and generalizing well to different objects.


Premonition: Using Generative Models to Preempt Future Data Changes in Continual Learning

http://arxiv.org/abs/2403.07356v1

Compressor summary: The paper proposes a method to generate synthetic data based on text descriptions and use it for pre-training models, which improves their performance in continual learning tasks.


BID: Boundary-Interior Decoding for Unsupervised Temporal Action Localization Pre-Trainin

http://arxiv.org/abs/2403.07354v1

Compressor summary: The text introduces BID, an unsupervised framework that partitions motion sequences into meaningful pre-action segments, improving action localization and understanding performance.


Graph Unlearning with Efficient Partial Retraining

http://arxiv.org/abs/2403.07353v1

Compressor summary: GraphRevoker is a novel framework that improves the unlearning process of GNNs by preserving graph properties and aggregating sub-models effectively.


KEBench: A Benchmark on Knowledge Editing for Large Vision-Language Models

http://arxiv.org/abs/2403.07350v1

Compressor summary: The paper introduces KEBench, a new benchmark for knowledge editing in large vision-language models, with an extended metric (Portability) and improved image data quality to better evaluate model performance.


Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture

http://arxiv.org/abs/2403.07347v1

Compressor summary: FD4MM is a new approach to reveal subtle motions in videos by separating and enhancing low-frequency motion fields and high-frequency details using sparse filters and contrastive regularization, achieving better performance and efficiency than existing methods.


Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

http://arxiv.org/abs/2403.07346v1

Compressor summary: EvRGBHand is a novel approach for 3D hand mesh reconstruction that combines an event camera and an RGB camera to overcome each other's limitations and improve performance in challenging scenarios.


Rethinking ASTE: A Minimalist Tagging Scheme Alongside Contrastive Learning

http://arxiv.org/abs/2403.07342v1

Compressor summary: ASTE is a subtask of sentiment analysis that extracts structured sentiment triplets from text, and the proposed method uses a novel tagging scheme and contrastive learning to improve performance over existing approaches and LLMs.


IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

http://arxiv.org/abs/2403.07339v1

Compressor summary: The authors propose a method (IM-Unpack) to represent heavy hitters in GEMM matrices with low bit-width integers by unpacking them into multiple smaller matrices, achieving efficiency gains and parity with floating point calculations.


Large Window-based Mamba UNet for Medical Image Segmentation: Beyond Convolution and Self-attention

http://arxiv.org/abs/2403.07332v1

Compressor summary: LMa-UNet is a novel medical image segmentation method that leverages large windows and a hierarchical Mamba block to achieve efficient long-range dependency modeling with linear complexity.


Unknown Domain Inconsistency Minimization for Domain Generalization

http://arxiv.org/abs/2403.07329v1

Compressor summary: The paper proposes UDIM, a domain generalization method that minimizes loss inconsistency between source and perturbed domains by perturbing instances from the source dataset and combining it with SAM optimization.


SGE: Structured Light System Based on Gray Code with an Event Camera

http://arxiv.org/abs/2403.07326v1

Compressor summary: The paper introduces Gray code in event-based structured light systems, enabling fast and accurate depth estimation with high-speed projection and spatio-temporal encoding.


GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method

http://arxiv.org/abs/2403.07321v1

Compressor summary: Key points: - The paper introduces GRiD, a dataset for detecting ChatGPT-generated text - The dataset contains Reddit context-prompt pairs with both human and ChatGPT responses - GpTen is a new semi-supervised tensor-based detection method that performs well on the dataset Summary: The paper presents GRiD, a novel dataset to detect ChatGPT-generated text from Reddit contexts, and proposes GpTen, a semi-supervised tensor-based detection method.


Efficient Diffusion Model for Image Restoration by Residual Shifting

http://arxiv.org/abs/2403.07319v1

Compressor summary: The proposed method improves image restoration by efficiently shifting between high-quality and low-quality images using a Markov chain and a flexible noise schedule, achieving superior or comparable results to current methods with fewer sampling steps.


Knowledge Graph Large Language Model (KG-LLM) for Link Prediction

http://arxiv.org/abs/2403.07311v1

Compressor summary: The paper presents a new method to predict multiple links in knowledge graphs using natural language processing techniques, such as chain-of-thought prompting and in-context learning, which improve the performance and generalization of large language models.


Reinforced Sequential Decision-Making for Sepsis Treatment: The POSNEGDM Framework with Mortality Classifier and Transformer

http://arxiv.org/abs/2403.07309v1

Compressor summary: The paper proposes POSNEGDM, a reinforcement learning framework that uses positive and negative demonstrations and individual patient characteristics to guide sepsis treatment, achieving higher survival rates than existing methods.


Verification-Aided Learning of Neural Network Barrier Functions with Termination Guarantees

http://arxiv.org/abs/2403.07308v1

Compressor summary: The paper proposes a holistic approach to learn barrier functions for system safety with finite-step termination guarantees, by first learning an NN basis function and then fine-tuning it with convexity and counterexamples from verification failure.


Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

http://arxiv.org/abs/2403.07304v1

Compressor summary: Lumen is a large multimodal model that enhances perception capabilities by decoupling learning into task-agnostic and task-specific stages, improving performance on various visual tasks.


Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal Storyteller

http://arxiv.org/abs/2403.07301v1

Compressor summary: LLaMS generates high-quality, multimodal stories from image streams using commonsense knowledge, textual reasoning, and story illustration.


Taming Pre-trained LLMs for Generalised Time Series Forecasting via Cross-modal Knowledge Distillation

http://arxiv.org/abs/2403.07300v1

Compressor summary: The LLaTA framework aligns language models and time series data to improve multivariate forecasting by leveraging both static and dynamic knowledge from large language models.


Graph Data Condensation via Self-expressive Graph Structure Reconstruction

http://arxiv.org/abs/2403.07294v1

Compressor summary: Graph Data Condensation via Self-expressive Graph Structure Reconstruction (GCSR) is a novel framework that condenses large-scale graphs by incorporating the original graph structure and reconstructing an interpretable synthetic graph.