arxiv compressed, 2024-07-16

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-16 generated by the compressor, my personal LLM-based project.


Make-An-Agent: A Generalizable Policy Network Generator with Behavior-Prompted Diffusion

http://arxiv.org/abs/2407.10973v1

Compressor summary: Key points: - The paper introduces Make-An-Agent, a policy generator that uses conditional diffusion models to create policies from behavior embeddings - The model is trained on policy network checkpoints and their trajectories - It can generate versatile and scalable policies with few-shot demonstrations - It works on various tasks, domains, and robots, including real-world locomotion Summary: Make-An-Agent is a novel policy generator that creates versatile and scalable policies from few-shot demonstrations using conditional diffusion models and behavior embeddings.


VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

http://arxiv.org/abs/2407.10972v1

Compressor summary: VGBench is a benchmark for testing Large Language Models' ability to understand and generate vector graphics in various formats and contexts.


Walking the Values in Bayesian Inverse Reinforcement Learning

http://arxiv.org/abs/2407.10971v1

Compressor summary: ValueWalk is a Bayesian IRL method that simplifies reward recovery by focusing on Q-values instead of rewards, making computation faster and enabling efficient sampling using Hamiltonian Monte Carlo.


Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

http://arxiv.org/abs/2407.10969v1

Compressor summary: Q-Sparse is an efficient method for training large language models that achieves comparable results while reducing inference time and costs, and it works well with 1-bit LLMs like BitNet b1.58.


BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning

http://arxiv.org/abs/2407.10967v1

Compressor summary: BECAUSE is an algorithm that uses causal representation for states and actions in offline reinforcement learning to reduce objective mismatch and improve performance.


No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations

http://arxiv.org/abs/2407.10964v1

Compressor summary: FUNGI is a simple method to enhance vision encoders' features using self-supervised gradients, leading to consistent performance improvements across various tasks and domains.


Fast Matrix Multiplications for Lookup Table-Quantized LLMs

http://arxiv.org/abs/2407.10960v1

Compressor summary: FLUTE is a fast lookup table engine for large language models that uses weight-only quantization and offline restructuring to reduce memory movement, enabling faster inference with competitive quantization performance.


InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models

http://arxiv.org/abs/2407.10958v1

Compressor summary: InVi is a method to insert or replace objects in videos using text-to-image models, addressing challenges of quality control, blending, and temporal coherence by employing a two-step process and extended-attention layers.


Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

http://arxiv.org/abs/2407.10957v1

Compressor summary: The Ref-AVS task introduces a new way to segment objects in visual scenes using natural language expressions with audio and visual cues, and a new method is proposed that effectively utilizes these multimodal cues for precise segmentation.


Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

http://arxiv.org/abs/2407.10956v1

Compressor summary: Spider2-V is a benchmark for testing multimodal agents' ability to automate data science and engineering tasks in real-world enterprise software systems.


MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

http://arxiv.org/abs/2407.10953v1

Compressor summary: The paper introduces a multilingual dataset for mutual reinforcement effect research and a method to translate it using large language models, leading to better open-domain information extraction with a unified model.


Representing Rule-based Chatbots with Transformers

http://arxiv.org/abs/2407.10949v1

Compressor summary: The text explains how researchers constructed a Transformer-based ELIZA chatbot to understand the mechanisms underlying naturalistic conversations.


Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

http://arxiv.org/abs/2407.10947v1

Compressor summary: The paper proposes using text cues from image captions to improve audio guidance for segmenting sounding objects in visual scenes.


Learning from Naturally Occurring Feedback

http://arxiv.org/abs/2407.10944v1

Compressor summary: The authors propose a scalable method for extracting naturalistic user feedback from chat data and show that it improves language model performance and alignment with human preferences.


IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

http://arxiv.org/abs/2407.10937v1

Compressor summary: IDOL is a novel method for generating high-quality human-centric videos and depth maps using dual-modal U-Net and motion consistency losses.


STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

http://arxiv.org/abs/2407.10935v1

Compressor summary: STARS is a self-supervised method for 3D action recognition that improves semantic clustering and generalization using encoder-decoder masked prediction and nearest-neacher contrastive learning.


Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together

http://arxiv.org/abs/2407.10930v1

Compressor summary: The paper proposes a method to fine-tune NLP systems with multiple stages by optimizing both the language models and prompting strategies together, without gold labels for intermediate stages.


OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting

http://arxiv.org/abs/2407.10923v1

Compressor summary: The paper proposes a novel text-guided out-painting framework using a State-Space Model called Mamba to generate 360-degree images from narrow field of view images, improving visual continuity and context richness with two modules: VCR and GMA.


Benchmarking Vision Language Models for Cultural Understanding

http://arxiv.org/abs/2407.10920v1

Compressor summary: CulturalVQA is a new benchmark for assessing vision-language models' cultural understanding across 11 countries, revealing disparities in their performance and areas of improvement.


PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition

http://arxiv.org/abs/2407.10918v1

Compressor summary: Key points: - Object recognition systems can be fooled by adversarial perturbations - Part-based models improve robustness but lack part annotations - PIN++ provides part segmentation annotations for ImageNet-1K categories - MPM is a multi-scale part-supervised model that learns from part annotations - MPM outperforms baselines on adversarial, corruption, and out-of-distribution tests Summary: The paper introduces PIN++, a dataset with part segmentation annotations for ImageNet-1K, and MPM, a multi-scale part-supervised model that improves object recognition robustness against various challenges.


When Heterophily Meets Heterogeneity: New Graph Benchmarks and Effective Methods

http://arxiv.org/abs/2407.10916v1

Compressor summary: The paper introduces H2GB, a new graph benchmark that combines heterophily and heterogeneity, along with UnifiedGT and H2G-former, a model variant that excels at this challenge.


DataDream: Few-shot Guided Dataset Generation

http://arxiv.org/abs/2407.10910v1

Compressor summary: DataDream is a framework that uses few-shot examples to generate realistic classification datasets, improving image classification accuracy with fewer real data examples.


Interpreting Hand gestures using Object Detection and Digits Classification

http://arxiv.org/abs/2407.10902v1

Compressor summary: This research aims to create a system that can accurately identify hand gestures representing numbers using computer vision, machine learning, and OpenCV techniques to improve human interaction with technology.


Deep Causal Learning to Explain and Quantify The Geo-Tension's Impact on Natural Gas Market

http://arxiv.org/abs/2407.10878v1

Compressor summary: The authors use deep neural networks to identify drivers of natural gas demand and create a counterfactual scenario without the Russian-Ukrainian war to estimate its impact on German energy sectors.


RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception

http://arxiv.org/abs/2407.10876v1

Compressor summary: The paper proposes RepVF, a unified representation for concurrent multi-task 3D perception in autonomous driving, and RFTR, a network that exploits the connections between tasks using a hierarchical structure of queries, reducing computational redundancy and feature competition.


GPT Sonograpy: Hand Gesture Decoding from Forearm Ultrasound Images via VLM

http://arxiv.org/abs/2407.10870v1

Compressor summary: GPT-4o is a large AI model that can understand hand gestures from ultrasound without much training, making it useful for various applications.


Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks

http://arxiv.org/abs/2407.10867v1

Compressor summary: The paper proposes a method to certify Graph Neural Networks against data poisoning and backdoor attacks by using the neural tangent kernel and a novel mixed-integer linear program.


R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection

http://arxiv.org/abs/2407.10862v1

Compressor summary: The paper proposes a new method, R3D-AD, for detecting anomalies in 3D parts using a diffusion model that obscures the defects and learns to correct them, as well as a simulation strategy to generate diverse defect shapes for better generalization.


Human-Centric Transformer for Domain Adaptive Action Recognition

http://arxiv.org/abs/2407.10860v1

Compressor summary: The Human-Centric Transformer is a method for recognizing actions in videos across different domains by focusing on human cues and their interactions with contexts.


Weighted Grouped Query Attention in Transformers

http://arxiv.org/abs/2407.10855v1

Compressor summary: Weighted Grouped-Query Attention (WGQA) improves attention mechanisms in language models by introducing new learnable parameters for key and value heads, leading to better performance with minimal inference overhead.


An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases

http://arxiv.org/abs/2407.10853v1

Compressor summary: The paper presents a decision framework to help practitioners assess bias and fairness risks in large language models (LLMs) by defining various metrics for different types of risks and use cases.


Rotationally Invariant Latent Distances for Uncertainty Estimation of Relaxed Energy Predictions by Graph Neural Network Potentials

http://arxiv.org/abs/2407.10844v1

Compressor summary: The paper proposes distribution-free techniques for uncertainty prediction in graph neural networks for molecular property prediction, especially relaxed energy calculations, and evaluates latent distance methods as a well-calibrated and economical approach.


Offline Reinforcement Learning with Imputed Rewards

http://arxiv.org/abs/2407.10839v1

Compressor summary: The paper proposes a Reward Model that estimates reward signals from few annotated samples and enables Offline Reinforcement Learning with many reward-free transitions.


Data-Guided Physics-Informed Neural Networks for Solving Inverse Problems in Partial Differential Equations

http://arxiv.org/abs/2407.10836v1

Compressor summary: The study proposes a novel framework called DG-PINNs, which improves the efficiency of solving inverse problems in PDEs by pre-training with only data loss and then fine-tuning with a composite loss function.


Exploration in Knowledge Transfer Utilizing Reinforcement Learning

http://arxiv.org/abs/2407.10835v1

Compressor summary: The paper compares three exploration methods in deep transfer learning for knowledge transfer, finding that upper confidence bound performs best on a virtual drone problem.


MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs

http://arxiv.org/abs/2407.10834v1

Compressor summary: MetaLLM is a framework that intelligently selects the best large language model for classification tasks, improving accuracy and cost-effectiveness.


Temporal Event Stereo via Joint Learning with Stereoscopic Flow

http://arxiv.org/abs/2407.10831v1

Compressor summary: Event cameras can perceive 3D environments even in extreme conditions, and the proposed temporal event stereo framework uses previous time steps to improve stereo matching performance efficiently.


BiasScanner: Automatic Detection and Classification of News Bias to Strengthen Democracy

http://arxiv.org/abs/2407.10829v1

Compressor summary: BiasScanner is a tool that helps users identify and understand media bias in online news articles using a pre-trained language model and a browser plug-in.


LLM Circuit Analyses Are Consistent Across Training and Scale

http://arxiv.org/abs/2407.10827v1

Compressor summary: The study tracks how circuits in decoder-only large language models evolve during training and finds that their algorithms and components remain consistent across model size and pre-training stages.


Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks

http://arxiv.org/abs/2407.10825v1

Compressor summary: Clean-label backdoor attacks are stealthy ways to manipulate deep neural networks by poisoning a small set of target class samples, which can be done with limited information and pose a serious threat when using third-party datasets for training.


Enabling MCTS Explainability for Sequential Planning Through Computation Tree Logic

http://arxiv.org/abs/2407.10820v1

Compressor summary: The paper proposes a novel logic-based explainer for Monte Carlo tree search (MCTS) used in transportation routing services, which translates user requirements into rigorous logic specifications and provides human-readable explanations of the algorithm's operation.


Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

http://arxiv.org/abs/2407.10817v1

Compressor summary: FLAMe, a family of large language models trained on diverse quality assessment tasks, outperforms proprietary LLMs like GPT-4 in various evaluation benchmarks and is less biased than LLM-as-a-Judge models.


Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification

http://arxiv.org/abs/2407.10814v1

Compressor summary: The paper proposes a multi-instance prompt learning framework that combines visual and textual prior knowledge with pre-trained models for few-shot pathology image analysis, improving diagnosis of key patterns.


FabGPT: An Efficient Large Multimodal Model for Complex Wafer Defect Knowledge Queries

http://arxiv.org/abs/2407.10810v1

Compressor summary: FabGPT is a multimodal model for IC fabrication that detects defects, analyzes causes, and answers questions on processes using SEM images and text.


Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain

http://arxiv.org/abs/2407.10807v1

Compressor summary: The paper proposes a novel natural language data stream classification method using convolutional deep networks to detect fake news from text data encoded as discrete digital signals.


Enhancing Robustness to Noise Corruption for Point Cloud Model via Spatial Sorting and Set-Mixing Aggregation Module

http://arxiv.org/abs/2407.10806v1

Compressor summary: Set-Mixer is a novel network architecture for point cloud recognition that improves robustness to noise corruption by enhancing communication among points and preserving relative spatial information.


Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval

http://arxiv.org/abs/2407.10805v1

Compressor summary: Think-on-Graph 2.0 is a framework that uses a knowledge graph to guide language models for better information retrieval and reasoning, improving their accuracy and performance.


Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment

http://arxiv.org/abs/2407.10804v1

Compressor summary: Mix-CPT is a new domain adaptation framework for large language models that combines knowledge memorization, utilization, and format alignment with minimal data requirements.


DINO Pre-training for Vision-based End-to-end Autonomous Driving

http://arxiv.org/abs/2407.10803v1

Compressor summary: The article proposes a self-supervised learning method for pre-training visual autonomous driving agents that improves their performance compared to classification-based methods.


Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation

http://arxiv.org/abs/2407.10802v1

Compressor summary: Key points: - Current optical flow and point-tracking methods rely on synthetic datasets - Event cameras have advantages in challenging visual conditions - A novel self-supervised loss combines contrast maximization with non-linear motion prior - The method improves performance in dense motion estimation and optical flow estimation Summary: The paper proposes a new self-supervised loss that leverages event camera data and enhances performance in motion estimation and optical flow tasks.


Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping

http://arxiv.org/abs/2407.10795v1

Compressor summary: The paper introduces a new contrastive decoding method for improving large language models' reasoning performance on diverse languages by skipping some layers to avoid language mismatch.


Graphusion: Leveraging Large Language Models for Scientific Knowledge Graph Fusion and Construction in NLP Education

http://arxiv.org/abs/2407.10794v1

Compressor summary: Graphusion is a zero-shot knowledge graph construction framework that fuses global information from text, improving performance on link prediction and NLP tasks like QA.


GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework

http://arxiv.org/abs/2407.10793v1

Compressor summary: GraphEval is a framework to evaluate large language model responses using knowledge graph structures, which helps detect inconsistencies (hallucinations) and potentially correct them.


AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware Uncertainty Calibrator and Label Distribution Handler

http://arxiv.org/abs/2407.10784v1

Compressor summary: AdapTable is a novel tabular test-time adaptation method that modifies output probabilities to handle distribution shifts, skewed entropy, latent space decision boundaries, confidence calibration issues, and model bias in tabular data.


Correlations Are Ruining Your Gradient Descent

http://arxiv.org/abs/2407.10780v1

Compressor summary: The text discusses how natural gradient descent, data decorrelation, and approximate methods for backpropagation can improve neural network training by considering local curvature and reducing correlations in data.


The Missing Link: Allocation Performance in Causal Machine Learning

http://arxiv.org/abs/2407.10779v1

Compressor summary: Automated decision-making systems using causal ML models face challenges in complex social environments, which affect their performance in various tasks, as illustrated by a real-world jobseekers dataset.


Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning

http://arxiv.org/abs/2407.10775v1

Compressor summary: The paper proposes a framework for constrained reinforcement learning using gradient-based primal-dual algorithms, introduces an exploration-agnostic algorithm called C-PG, and shows its effectiveness on constrained control problems.


MSegRNN:Enhanced SegRNN Model with Mamba for Long-Term Time Series Forecasting

http://arxiv.org/abs/2407.10768v1

Compressor summary: MSegRNN is a variant of SegRNN that uses Mamba structure and other enhancements to improve memory efficiency and performance in long-term time series forecasting.


Domain Generalization for 6D Pose Estimation Through NeRF-based Image Synthesis

http://arxiv.org/abs/2407.10762v1

Compressor summary: The authors propose a new augmentation method using Neural Radiance Fields to generate diverse synthetic images for improving 6D pose estimation on spacecraft poses.


Physics-Informed Machine Learning for Smart Additive Manufacturing

http://arxiv.org/abs/2407.10761v1

Compressor summary: The paper proposes a physics-informed machine learning model that combines neural networks and physical laws for better smart manufacturing outcomes in laser metal deposition.


Continual Deep Learning on the Edge via Stochastic Local Competition among Subnetworks

http://arxiv.org/abs/2407.10758v1

Compressor summary: The paper proposes a novel method that uses stochastic competition to create sparse task-specific representations in deep networks, enabling efficient continual learning on edge devices with limited resources.


GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation

http://arxiv.org/abs/2407.10756v1

Compressor summary: GTPT is a new method for efficient human pose estimation using Transformer that introduces keypoints coarsely and prunes redundant tokens, achieving high performance on COCO and COCO-WholeBody datasets.


An Autonomous Drone Swarm for Detecting and Tracking Anomalies among Dense Vegetation

http://arxiv.org/abs/2407.10754v1

Compressor summary: Swarms of drones using anomaly detection and adaptive sampling can effectively detect and track occluded targets in dense vegetation with high accuracy and low latency.


OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

http://arxiv.org/abs/2407.10753v1

Compressor summary: OPEN is a novel multi-view 3D object detector that uses object-wise depth information to improve detection accuracy, achieving state-of-the-art performance on nuScenes.


SEED: A Simple and Effective 3D DETR in Point Clouds

http://arxiv.org/abs/2407.10749v1

Compressor summary: SEED is a 3D DETR method that addresses challenges in point cloud detection by using dual query selection and deformable grid attention modules, achieving state-of-the-art performance on Waymo and nuScenes datasets.


Codebook LLMs: Adapting Political Science Codebooks for LLM Use and Adapting LLMs to Follow Codebooks

http://arxiv.org/abs/2407.10747v1

Compressor summary: The authors examine how large language models can be used for coding unstructured political texts using codebooks and propose instruction-tuning as a way to improve performance.


What distinguishes conspiracy from critical narratives? A computational analysis of oppositional discourse

http://arxiv.org/abs/2407.10745v1

Compressor summary: Key points: - Conspiracy theories and critical texts need different annotation schemes - Inter-group conflict is important in oppositional narratives - XAI-DisInfodemics corpus contains annotated COVID-19 Telegram messages - NLP-based automatization can distinguish conspiracy vs. critical texts Summary: The paper proposes a new annotation scheme for conspiracy theories and critical texts, uses a multilingual corpus of COVID-19 messages, and shows that NLP can detect inter-group conflict and violence in oppositional narratives.


AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

http://arxiv.org/abs/2407.10738v1

Compressor summary: Key points: - The paper proposes AccDiffusion, a method for patch-wise higher-resolution image generation without training - It decouples the vanilla prompt into patch-content-aware prompts to avoid repeated object generation - It introduces dilated sampling with window interaction for better global consistency Summary: AccDiffusion is a new method that generates high-resolution images from low-resolution ones by using patch-content-aware prompts and dilated sampling, avoiding repeated objects and improving global consistency.


Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models

http://arxiv.org/abs/2407.10737v1

Compressor summary: Key points: - The text presents Vi-ST, a model that uses a Vision Transformer and a spatiotemporal convolutional neural network to study the temporal features of visual coding in natural scenes - The model performs well in generalization tests and reveals the importance of each temporal module - The text introduces a new metric for evaluating visual coding based on neuronal activity Summary: Vi-ST is a novel model that uses deep learning to unravel how the brain encodes dynamic visual scenes with neurons, and proposes a new metric for measuring this encoding.


Transforming Agency. On the mode of existence of Large Language Models

http://arxiv.org/abs/2407.10735v1

Compressor summary: The paper examines the nature of Large Language Models (LLMs) like ChatGPT and argues that they are not autonomous agents but interlocutors or linguistic automata that can create realistic conversation experiences with humans.


When Synthetic Traces Hide Real Content: Analysis of Stable Diffusion Image Laundering

http://arxiv.org/abs/2407.10736v1

Compressor summary: The paper discusses the forensic implications of image laundering using Stable Diffusion models and proposes a two-stage detection pipeline to differentiate between real, laundered, and synthetic images.


On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

http://arxiv.org/abs/2407.10734v1

Compressor summary: The paper proposes a method to train deep neural networks (DNNs) efficiently on low-resource microcontrollers using quantized training and dynamic partial gradient updates.


Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

http://arxiv.org/abs/2407.10733v1

Compressor summary: Mask-JEPA is a self-supervised learning framework for segmentation models that combines mask classification architectures with a joint embedding predictive architecture, addressing challenges in extracting representations and training the decoder, and achieving competitive results and adaptability across various datasets.


ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation

http://arxiv.org/abs/2407.10730v1

Compressor summary: ConvBench is a benchmark for evaluating and comparing convolution algorithms in deep learning models by assessing 9243 operations from 1097 real-world models and providing detailed performance and execution breakdown graphs.


CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses

http://arxiv.org/abs/2407.10725v1

Compressor summary: The paper introduces CLAVE, a framework for assessing Large Language Models' values using two complementary LLMs, and ValEval, a dataset with diverse value systems to benchmark evaluators.


Anticipating Future Object Compositions without Forgetting

http://arxiv.org/abs/2407.10723v1

Compressor summary: The paper proposes new methods to improve object detection in computer vision models by enhancing their ability to learn and generalize from novel compositions of objects and attributes.


Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning

http://arxiv.org/abs/2407.10718v1

Compressor summary: Sibyl is a large language model-based agent framework that uses a global workspace, a multi-agent debate-based jury, and tools to enhance complex reasoning skills.


Detecting Omissions in Geographic Maps through Computer Vision

http://arxiv.org/abs/2407.10709v1

Compressor summary: The paper presents a computer vision method to automatically identify and analyze maps, especially those with designated names of regions and landmarks, using a Convolutional Neural Network and the VinMap dataset.


Interactive Rendering of Relightable and Animatable Gaussian Avatars

http://arxiv.org/abs/2407.10707v1

Compressor summary: Our method uses Gaussian Splatting to efficiently render animatable avatars from sparse-view or monocular videos under novel viewpoints, poses, and lightings.


Quantized Prompt for Efficient Generalization of Vision-Language Models

http://arxiv.org/abs/2407.10704v1

Compressor summary: The paper proposes quantization as a regularization technique for vision-language models, reducing overfitting and catastrophic forgetting while maintaining efficiency and generalization, and provides code at GitHub.


Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation

http://arxiv.org/abs/2407.10703v1

Compressor summary: The paper proposes a method to translate annotated day events into night events using Diffusion GAN and improves event-based models' performance on night scenes.


Geometric Analysis of Unconstrained Feature Models with $d=K$

http://arxiv.org/abs/2407.10702v1

Compressor summary: Neural Collapse occurs when training deep neural networks for classification tasks with equal feature dimension and class numbers, and is related to saddle points in popular unconstrained feature models.


DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

http://arxiv.org/abs/2407.10701v1

Compressor summary: DocBench is a new benchmark for evaluating large language model-based document reading systems on real documents and questions across various domains.


Deep ContourFlow: Advancing Active Contours with Deep Learning

http://arxiv.org/abs/2407.10696v1

Compressor summary: The paper proposes a new method that combines unsupervised active contour models with deep learning for robust image segmentation, especially useful in histology where labeling data is scarce.


IE-NeRF: Inpainting Enhanced Neural Radiance Fields in the Wild

http://arxiv.org/abs/2407.10695v1

Compressor summary: Our method enhances NeRF with image inpainting to handle transient objects and improve volume rendering quality for realistic novel view synthesis using uncontrolled photos in the wild.


Features Reconstruction Disentanglement Cloth-Changing Person Re-Identification

http://arxiv.org/abs/2407.10694v1

Compressor summary: The paper proposes FRD-ReID, a person re-identification network that controllably separates clothing-unrelated and clothing-related features using human parsing masks and attention mechanisms.


Probability Passing for Graph Neural Networks: Graph Structure and Representations Joint Learning

http://arxiv.org/abs/2407.10688v1

Compressor summary: The paper proposes PPGNN, a method that infers a latent graph structure from node features, refines it with Probability Passing, and applies it to GNNs for non-Euclidean data analysis while accounting for noise and improving efficiency.


FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation

http://arxiv.org/abs/2407.10687v1

Compressor summary: FRI-Net is a novel method that reconstructs 2D floorplans from 3D point clouds using an implicit representation with structural regularization, outperforming existing methods on two challenging datasets.


Addressing Image Hallucination in Text-to-Image Generation through Factual Image Retrieval

http://arxiv.org/abs/2407.10683v1

Compressor summary: Key points: - Text-to-image generation with diffusion models can produce factually inconsistent images (Image hallucination) - Image hallucination is classified into three types based on language model studies - Factual information from external images is used to generate realistic images using image editing tools Summary: The paper proposes a method to improve text-to-image generation by detecting and correcting three types of image hallucination using factual information from external images and image editing tools.


GeoMix: Towards Geometry-Aware Data Augmentation

http://arxiv.org/abs/2407.10681v1

Compressor summary: Geometric Mixup (GeoMix) is a simple and interpretable technique that uses in-place graph editing to synthesize nodes and establish connections for them, improving node classification with limited labeled data using graph neural networks (GNNs).


Qwen2 Technical Report

http://arxiv.org/abs/2407.10671v1

Compressor summary: The Qwen2 series is a new line of large language models and multimodal models that outperform previous models and have strong performance in various tasks across multiple languages.


Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

http://arxiv.org/abs/2407.10670v1

Compressor summary: The paper proposes Query Rewriter+, Knowledge Filter, Memory Knowledge Reservoir, and Retriever Trigger to improve the RAG system's response quality and efficiency by addressing issues such as Information Plateaus, Ambiguity, Irrelevant Knowledge, and Redundant Retrieval.


Spatio-temporal neural distance fields for conditional generative modeling of the heart

http://arxiv.org/abs/2407.10663v1

Compressor summary: The paper introduces a new model that uses neural distance fields to capture the shape and movement of the heart chambers and their relation to clinical demography, improving on existing methods for anatomical sequence completion.


XEQ Scale for Evaluating XAI Experience Quality Grounded in Psychometric Theory

http://arxiv.org/abs/2407.10662v1

Compressor summary: The XAI Experience Quality (XEQ) Scale is a new evaluation tool that measures the quality of XAI experiences based on four dimensions: learning, utility, fulfilment and engagement.


An Empirical Study of Validating Synthetic Data for Formula Generation

http://arxiv.org/abs/2407.10657v1

Compressor summary: The paper proposes a method to generate synthetic formulas for fine-tuning LLMs using NL and shows that validating these formulas improves performance and problem-solving ability.


OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

http://arxiv.org/abs/2407.10655v1

Compressor summary: The paper introduces OVLW-DETR, a fast and flexible open-vocabulary detector that uses vision-language model embeddings to detect novel categories guided by text.


Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews

http://arxiv.org/abs/2407.10652v1

Compressor summary: The study shows that using large language models can significantly speed up and improve the accuracy of literature review filtering, reducing manual work and achieving high recalls.


APC: Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

http://arxiv.org/abs/2407.10649v1

Compressor summary: The paper introduces APC, a ViT-based method for weakly supervised semantic segmentation that improves patch embeddings using Adaptive-K Pooling and Patch Contrastive Learning, and enhances efficiency by adopting an end-to-end single-stage training approach.


Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models

http://arxiv.org/abs/2407.10645v1

Compressor summary: This study examines how prompt selection affects text annotation accuracy using large language models and proposes a method to optimize prompts.


Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems

http://arxiv.org/abs/2407.10641v1

Compressor summary: DDIP and D3IP are methods to improve 3D image reconstruction from generative diffusion priors, using meta-learning and efficient adaptation for diverse tasks and data.


Evaluating Model Bias Requires Characterizing its Mistakes

http://arxiv.org/abs/2407.10633v1

Compressor summary: SkewSize is a metric that measures and characterizes model biases in various settings by analyzing the interaction between spurious variables and predictions.


Balancing the Scales: Reinforcement Learning for Fair Classification

http://arxiv.org/abs/2407.10629v1

Compressor summary: The paper explores using reinforcement learning to address bias in imbalanced classification by scaling the reward function based on contextual multi-armed bandits.


Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

http://arxiv.org/abs/2407.10627v1

Compressor summary: Arena Learning is a method that uses AI-driven annotations to evaluate and improve large language models efficiently.


NoviCode: Generating Programs from Natural Language Utterances by Novices

http://arxiv.org/abs/2407.10626v1

Compressor summary: NoviCode is a new task that challenges Text-to-Code models to generate executable programs from natural language descriptions by novice non-programmers, using API access and control structures, and aligning NL utterances with the code structure improves performance.


WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

http://arxiv.org/abs/2407.10625v1

Compressor summary: WildVidFit is an image-based model that generates realistic video try-on sequences by conditioning on garment descriptions and human motion, using diffusion guidance from pre-trained models to maintain temporal coherence.


An evaluation of CNN models and data augmentation techniques in hierarchical localization of mobile robots

http://arxiv.org/abs/2407.10596v1

Compressor summary: This paper evaluates various CNN models and data augmentation techniques for hierarchical mobile robot localization using omnidirectional images, and provides public code on the project website.


InsertDiffusion: Identity Preserving Visualization of Objects through a Training-Free Diffusion Architecture

http://arxiv.org/abs/2407.10592v1

Compressor summary: InsertDiffusion is a training-free diffusion architecture that efficiently embeds realistic object visualizations into images without extensive training or fine-tuning.


COSMU: Complete 3D human shape from monocular unconstrained images

http://arxiv.org/abs/2407.10586v1

Compressor summary: The text presents a new method to reconstruct detailed 3D human shapes from monocular images by generating multi-view normal maps and using an attention-based neural implicit model.


Three Dogmas of Reinforcement Learning

http://arxiv.org/abs/2407.10583v1

Compressor summary: The text discusses three dogmas in modern reinforcement learning that have shaped the field but may need reevaluation for realizing its full potential.


Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

http://arxiv.org/abs/2407.10582v1

Compressor summary: The paper proposes a method to generate task-specific data for low-resource languages using large language models and teacher models, improving cross-lingual performance on various tasks.


Leveraging Hybrid Intelligence Towards Sustainable and Energy-Efficient Machine Learning

http://arxiv.org/abs/2407.10580v1

Compressor summary: The paper proposes a hybrid intelligence approach that uses human input and artificial intelligence to create more energy-efficient machine learning models.


A Survey of Defenses against AI-generated Visual Media: Detection, Disruption, and Authentication

http://arxiv.org/abs/2407.10575v1

Compressor summary: This paper reviews defenses against AI-generated visual media in computer vision applications, covering detection, disruption, and authentication methods, as well as trustworthiness aspects like robustness and fairness.


Stacking-Enhanced Bagging Ensemble Learning for Breast Cancer Classification with CNN

http://arxiv.org/abs/2407.10574v1

Compressor summary: The paper presents a fast and accurate CNN model for breast cancer classification using Bagging and stacking ensemble methods, achieving high accuracy and recall rates and outperforming VGG16 and ResNet-50 in comparative experiments.


PULPo: Probabilistic Unsupervised Laplacian Pyramid Registration

http://arxiv.org/abs/2407.10567v1

Compressor summary: PULPo is a probabilistic deformable image registration method that accurately estimates uncertainty using Laplacian pyramids on hierarchical levels.


Pathformer3D: A 3D Scanpath Transformer for 360° Images

http://arxiv.org/abs/2407.10563v1

Compressor summary: The paper introduces Pathformer3D, a novel 3D scanpath Transformer for predicting eye movements in 360° images that uses self-attention to model visual working memory and outperforms existing methods.


LIP-CAR: contrast agent reduction by a deep learned inverse problem

http://arxiv.org/abs/2407.10559v1

Compressor summary: Key points: - Contrast agents are crucial for medical imaging but have limitations - The CAR problem is to reduce their dosage while maintaining visual enhancement - A learned inverse problem (LIP) approach is proposed and tested on pre-clinical images Summary: The paper proposes a new method to reduce contrast agent dosage in medical imaging using a learned inverse problem approach and shows its effectiveness on pre-clinical images.


ConTEXTure: Consistent Multiview Images to Texture

http://arxiv.org/abs/2407.10558v1

Compressor summary: ConTEXTure is a network that creates texture maps for 3D meshes using text prompts and multiple images from different viewpoints, improving accuracy by generating consistent images with Zero123++.


Beyond Generative Artificial Intelligence: Roadmap for Natural Language Generation

http://arxiv.org/abs/2407.10554v1

Compressor summary: The paper reviews recent surveys in Natural Language Generation (NLG) to identify gaps and challenges posed by Large Language Models (LLMs) and propose a research roadmap for the field.


Learning Natural Consistency Representation for Face Forgery Video Detection

http://arxiv.org/abs/2407.10550v1

Compressor summary: The paper proposes a self-supervised method for detecting face forgery videos using natural spatiotemporal consistency of real face videos, improving generalization and robustness over existing methods.


Efficient Continual Learning with Low Memory Footprint For Edge Device

http://arxiv.org/abs/2407.10545v1

Compressor summary: LightCL is a compact algorithm that improves continual learning efficiency by compressing resource consumption and enhancing generalizability using new metrics on neural network layers.


Understanding the Dependence of Perception Model Competency on Regions in an Image

http://arxiv.org/abs/2407.10543v1

Compressor summary: The authors propose five methods to help DNN-based perception models identify unfamiliar regions in images, which can improve their decision-making when facing low competency situations.


3D Geometric Shape Assembly via Efficient Point Cloud Matching

http://arxiv.org/abs/2407.10542v1

Compressor summary: The paper introduces PMT and PMTR, a new framework for reliable and efficient geometric shape assembly using local correspondences and high-order feature transforms.


An experimental evaluation of Siamese Neural Networks for robot localization using omnidirectional imaging in indoor environments

http://arxiv.org/abs/2407.10536v1

Compressor summary: Key points: - Paper explores Siamese Neural Networks for localization using catadioptric vision system and panoramic images - Siamese Neural Networks generate similarity function between images based on descriptors - Results outperform previous techniques, especially in cloudy and night conditions Summary: The paper proposes a method to locate robots indoors using Siamese Neural Networks that compare panoramic images from a catadioptric vision system and achieve better performance than previous approaches in challenging lighting situations.


Automated Label Unification for Multi-Dataset Semantic Segmentation with GNNs

http://arxiv.org/abs/2407.10534v1

Compressor summary: Key points: - Deep supervised models can use multiple datasets to improve performance - Different label spaces among datasets cause conflicts and lower performance - Paper proposes a graph neural network approach to unify label space across datasets - Method enables semantic segmentation without extra manual work - Method outperforms other multi-dataset training methods and achieves state-of-the-art results Summary: The paper presents a graph neural network method that unifies different label spaces among multiple datasets, allowing semantic segmentation models to be trained more effectively and efficiently without extra manual reannotation.


Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

http://arxiv.org/abs/2407.10528v1

Compressor summary: The paper proposes a method to generate realistic human motions by using local actions as control signals and blending them with graph attention networks and motion diffusion.


TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription Prediction

http://arxiv.org/abs/2407.10510v1

Compressor summary: The authors introduce DigestDS, a dataset for predicting TCM prescriptions for digestive system diseases, and propose TCM-FTP, a method that fine-tunes pre-trained language models to achieve significant improvements in prediction accuracy.


CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

http://arxiv.org/abs/2407.10499v1

Compressor summary: The paper introduces CIBench, a framework to evaluate how well large language models use code interpreters for data science tasks with or without human help.


Improving Hyperbolic Representations via Gromov-Wasserstein Regularization

http://arxiv.org/abs/2407.10495v1

Compressor summary: The authors propose using the Gromov-Wasserstein distance as a regularization mechanism in hyperbolic neural networks to better preserve the original data structure.


Learning to Unlearn for Robust Machine Unlearning

http://arxiv.org/abs/2407.10494v1

Compressor summary: The paper proposes a novel framework called Learning-to-Unlearn that uses meta-learning to balance between erasing specific data samples and maintaining the overall performance of the model.


Learning Dynamics of LLM Finetuning

http://arxiv.org/abs/2407.10490v1

Compressor summary: The text studies how large language models learn new tasks by analyzing their step-by-step changes and influences on different responses, helping to understand and improve their behavior.


How and where does CLIP process negation?

http://arxiv.org/abs/2407.10488v1

Compressor summary: The text explores how to understand the internal workings of vision & language models, focusing on their handling of negation, using CLIP as an example.


Lite2Relight: 3D-aware Single Image Portrait Relighting

http://arxiv.org/abs/2407.10487v1

Compressor summary: Lite2Relight is a novel technique that enables realistic 3D view synthesis and light editing of human portraits with efficient volumetric representation and robust face geometry, outperforming state-of-the-art methods.


IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization

http://arxiv.org/abs/2407.10486v1

Compressor summary: The paper introduces two modules for improving query-focused summarization using large language models, addressing document length and fine-grained alignment.


Effective Motion Modeling for UAV-platform Multiple Object Tracking with Re-Margin Loss

http://arxiv.org/abs/2407.10485v1

Compressor summary: The text proposes a new method for tracking objects from UAVs that improves accuracy and efficiency by modeling motion with detection features and addressing the long-tailed distribution of object motion.


Understanding Matrix Function Normalizations in Covariance Pooling through the Lens of Riemannian Geometry

http://arxiv.org/abs/2407.10484v1

Compressor summary: The paper explains why Euclidean classifiers work well with Riemannian features in GCP by analyzing matrix functions from a Riemannian geometry perspective and showing their effectiveness in visual classification tasks.


G-PCGRL: Procedural Graph Data Generation via Reinforcement Learning

http://arxiv.org/abs/2407.10483v1

Compressor summary: Key points: - Graph data structures model relationships and interconnections in various domains - G-PCGRL is a novel method for procedurally generating graph data using reinforcement learning - The method adapts PCGRL framework, introduces new representations, and fulfills constraints - G-PCGRL outperforms random search and evolutionary algorithm, and is controllable Summary: The paper presents G-PCGRL, a reinforcement learning-based method for generating graph data structures in games, which adapts PCGRL framework, introduces new representations, and satisfies constraints, while being faster, more reliable, and controllable than random search and evolutionary algorithm.


NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis

http://arxiv.org/abs/2407.10482v1

Compressor summary: NGP-RT improves the rendering speed of Instant-NGP by using hash features, lightweight attention, and a pre-computed occupancy distance grid for fast and high-quality novel view synthesis.


SuperPADL: Scaling Language-Directed Physics-Based Control with Progressive Supervised Distillation

http://arxiv.org/abs/2407.10481v1

Compressor summary: SuperPADL is a framework that uses both reinforcement and supervised learning to train controllers for real-time physics-based text-to-motion on thousands of diverse motion clips, allowing users to create interactive animations.


Kinetic Typography Diffusion Model

http://arxiv.org/abs/2407.10476v1

Compressor summary: The paper proposes a method for generating realistic and aesthetically pleasing kinetic typography videos from text prompts using guided video diffusion models, static and dynamic captions, and glyph loss.


DiffStega: Towards Universal Training-Free Coverless Image Steganography with Diffusion Models

http://arxiv.org/abs/2407.10459v1

Compressor summary: DiffStega is a training-free diffusion-based coverless image steganography method that uses a password-dependent reference image and the text as keys, enhancing security and versatility.


The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

http://arxiv.org/abs/2407.10457v1

Compressor summary: The study explores performance differences between greedy decoding and sampling methods in large language models, highlighting non-determinism's impact on evaluations and showing the potential of smaller models.


Don't Throw Away Data: Better Sequence Knowledge Distillation

http://arxiv.org/abs/2407.10456v1

Compressor summary: The paper proposes using multiple high scoring translations from minimum Bayes risk decoding in knowledge distillation for neural machine translation, achieving better results than existing methods.


Deflated Dynamics Value Iteration

http://arxiv.org/abs/2407.10454v1

Compressor summary: The text introduces Deflated Dynamics Value Iteration, a faster method than standard Value Iteration for computing the value function of Markov decision processes and reinforcement learning problems by removing dominant eigen-structures from the transition matrix.


Enhancing Medication Recommendation with LLM Text Representation

http://arxiv.org/abs/2407.10453v1

Compressor summary: The authors propose using a large language model to enhance medication recommendation by combining text and medical codes, improving performance on two datasets.


GraphPrint: Extracting Features from 3D Protein Structure for Drug Target Affinity Prediction

http://arxiv.org/abs/2407.10452v1

Compressor summary: GraphPrint is a framework for predicting drug target affinity using graph representations of protein 3D structures, which improve over traditional features alone.


A Fast, Robust Elliptical Slice Sampling Implementation for Linearly Truncated Multivariate Normal Distributions

http://arxiv.org/abs/2407.10449v1

Compressor summary: The paper introduces a fast and stable algorithm for constructing elliptical slice sampling intersections, which improves Monte Carlo methods.


Spectral Representation for Causal Estimation with Hidden Confounders

http://arxiv.org/abs/2407.10448v1

Compressor summary: The paper proposes a new method for causal effect estimation with hidden confounders using saddle-point optimization and neural networks.


Backdoor Attacks against Image-to-Image Networks

http://arxiv.org/abs/2407.10445v1

Compressor summary: The paper investigates the vulnerability of image-to-image networks to backdoor attacks and proposes a novel attack technique that can compromise these networks without affecting their normal behavior on clean images.


Enhancing Building Safety Design for Active Shooter Incidents: Exploration of Building Exit Parameters using Reinforcement Learning-Based Simulations

http://arxiv.org/abs/2407.10441v1

Compressor summary: This study uses a reinforcement learning-based simulation to investigate how exit availability and configuration affect evacuation and harm rates in active shooter incidents in office environments, finding that more exits near the shooter's starting point improve safety.


PolyRoom: Room-aware Transformer for Floorplan Reconstruction

http://arxiv.org/abs/2407.10439v1

Compressor summary: This paper proposes PolyRoom, a Transformer model that uses uniform sampling representation, room-aware query initialization, and room-aware self-attention to reconstruct floorplans from point clouds while overcoming common challenges.


A Multi-Stage Framework for 3D Individual Tooth Segmentation in Dental CBCT

http://arxiv.org/abs/2407.10433v1

Compressor summary: Key points: - CBCT is a common method for dental disease diagnosis - Accurate 3D tooth segmentation is important for treatment - Deep learning based methods need lots of annotated data and suffer from domain shift - The authors propose a multi-stage framework for 3D tooth segmentation in CBCT - Their approach achieves the third place in STS-3D challenge and outperforms other semi-supervised methods Summary: The authors present a novel multi-stage framework for 3D tooth segmentation in dental CBCT images, which requires less annotated data and is more robust to domain shift than deep learning based methods.


Expanding the Scope: Inductive Knowledge Graph Reasoning with Multi-Starting Progressive Propagation

http://arxiv.org/abs/2407.10430v1

Compressor summary: MStar is a new inductive KG reasoning model that uses C-MPNNs and multiple query-specific starting entities to improve message propagation efficiency and mitigate noise in training samples.


Omni-Dimensional Frequency Learner for General Time Series Analysis

http://arxiv.org/abs/2407.10419v1

Compressor summary: The Omni-Dimensional Frequency Learner (ODFL) model improves frequency-based methods for time series analysis by addressing channel redundancy, sparse frequency energy distribution, and semantic diversity in the spectrum feature.


Melon Fruit Detection and Quality Assessment Using Generative AI-Based Image Data Augmentation

http://arxiv.org/abs/2407.10413v1

Compressor summary: The study used generative AI to create realistic images of fruits, which improved fruit detection and quality assessment using deep learning models like YOLO.


Towards Scale-Aware Full Surround Monodepth with Transformers

http://arxiv.org/abs/2407.10406v1

Compressor summary: The paper proposes a transformer-based depth network and feature matching scheme to improve scale-awareness for full surround monodepth methods, achieving better performance than state-of-the-art methods.


Cooperative Reward Shaping for Multi-Agent Pathfinding

http://arxiv.org/abs/2407.10403v1

Compressor summary: The paper introduces a reward shaping technique for Multi-Agent Reinforcement Learning (MARL) to improve cooperation and efficiency in planning paths for multiple agents.


Exploring the Impact of Moire Pattern on Deepfake Detectors

http://arxiv.org/abs/2407.10399v1

Compressor summary: The study shows that Moiré patterns significantly reduce the accuracy of deepfake detectors when used on camera-captured videos from digital screens.


Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering

http://arxiv.org/abs/2407.10389v1

Compressor summary: The study proposes a framework that enhances NeRF rendering quality without increasing computational complexity by using a mixture of experts with varying resolutions and a novel gate formulation.


By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting

http://arxiv.org/abs/2407.10385v1

Compressor summary: The text proposes a visual prompting method using multimodal language models to improve the performance and efficiency of sensor data analysis for different sensory tasks.


NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

http://arxiv.org/abs/2407.10380v1

Compressor summary: NTSEBench is a new dataset for evaluating large language and vision models on complex cognitive multi-modal reasoning tasks based on questions from an Indian examination.


An Empirical Study of Mamba-based Pedestrian Attribute Recognition

http://arxiv.org/abs/2407.10374v1

Compressor summary: The paper proposes adapting Mamba, a computationally efficient model for pedestrian attribute recognition, into two frameworks and tests its performance with hybrid Mamba-Transformer variants.


Accessing Vision Foundation Models at ImageNet-level Costs

http://arxiv.org/abs/2407.10366v1

Compressor summary: Proteus is a simple method to distill large vision foundation models into smaller equivalents without the original training data, achieving comparable or better performance than other models.