This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-03 generated by the compressor, my personal LLM-based project.
http://arxiv.org/abs/2407.02490v1
Compressor summary: Minference is a sparse calculation method for efficient inference of long-context LLMs that reduces latency by up to 10x while maintaining accuracy.
http://arxiv.org/abs/2407.02489v1
Compressor summary: Magic Insert is a technique that lets users insert realistic objects from one image into another with different style by fine-tuning a text-to-image model, infusing it with the target style, and adapting an object insertion model to diverse artistic styles.
http://arxiv.org/abs/2407.02486v1
Compressor summary: Neurocache extends the context of large language models using a cache to store past states, improving inference speed and accuracy in various tasks.
http://arxiv.org/abs/2407.02485v1
Compressor summary: The paper proposes RankRAG, a framework that fine-tunes large language models for ranking contexts and generating answers in retrieval-augmented generation tasks, achieving better performance than existing models with less data.
http://arxiv.org/abs/2407.02483v1
Compressor summary: The paper presents MMedAgent, an AI agent for the medical domain that selects appropriate specialized models as tools based on user inputs and outperforms existing methods.
http://arxiv.org/abs/2407.02482v1
Compressor summary: The paper introduces a new method, Rich-contextual Conditional Diffusion Models (RCDMs), that improves story generation by using semantic and temporal context from known clips and reference images.
http://arxiv.org/abs/2407.02477v1
Compressor summary: The paper analyzes preference alignment methods for Multimodal Large Language Models (MLLMs), compares offline and online approaches, introduces a new dataset creation method called Bias-Driven Hallucination Sampling (BDHS), and shows its competitive performance.
http://arxiv.org/abs/2407.02476v1
Compressor summary: The paper proposes a method to speed up computations in a model that uses latent variables to capture covariance between multiple outputs.
http://arxiv.org/abs/2407.02474v1
Compressor summary: The paper proposes a two-dimensional model of emotion based on free energy, valence, and arousal, and demonstrates its application in simulating agents' emotions during a search task.
http://arxiv.org/abs/2407.02472v1
Compressor summary: ValueScope is a framework that uses language models to analyze and compare social norms across different online communities, revealing their diversity and evolution.
http://arxiv.org/abs/2407.02466v1
Compressor summary: Policy Learning with large World Models (PWM) is a new model-based RL algorithm that learns continuous control policies from large multi-task world models, enabling efficient solution of complex tasks with many actions and tasks without the need for online planning.
http://arxiv.org/abs/2407.02465v1
Compressor summary: The paper explores how agents can communicate their beliefs more effectively to avoid echo chambers and self-doubt in collaborative tasks.
http://arxiv.org/abs/2407.02455v1
Compressor summary: SUPER is a framework that uses dual-mmWave radars to estimate seated upper body human poses and outperforms existing methods by a large margin.
http://arxiv.org/abs/2407.02448v1
Compressor summary: The study proposes a new method for detecting hate speech in Arabic tweets using ensemble learning and semi-supervised learning, which improves accuracy over existing approaches.
http://arxiv.org/abs/2407.02447v1
Compressor summary: The paper proposes a new algorithm, PLeaS, that can merge different machine learning models without sharing data or base architecture, improving performance by 8 to 15 percentage points for certain tasks.
http://arxiv.org/abs/2407.02446v1
Compressor summary: RLHF models excel at text generation but struggle with world modeling due to their reliance on implicit blueprints for coherence.
http://arxiv.org/abs/2407.02445v1
Compressor summary: AssetGen is a text-to-3D generation system that produces high-quality meshes with realistic textures and materials using few views and human preferable results.
http://arxiv.org/abs/2407.02439v1
Compressor summary: The paper presents a model that predicts how people pay attention to graphic design documents, considering both the spatial and temporal aspects of visual fixation using deep learning techniques.
http://arxiv.org/abs/2407.02437v1
Compressor summary: The paper proposes Parameter Matching Attack (PMA), a new availability attack for machine learning models that can degrade their performance even when only partially perturbed data is used.
http://arxiv.org/abs/2407.02432v1
Compressor summary: The text discusses the challenges of evaluating ADE detection models in social media using hand-crafted templates for four capabilities.
http://arxiv.org/abs/2407.02431v1
Compressor summary: Graph reduction methods' effectiveness in mitigating backdoor attacks on GNNs varies significantly and some even worsen the situation, raising concerns about security trade-offs in scalable GNN training.
http://arxiv.org/abs/2407.02430v1
Compressor summary: Meta 3D TextureGen is a fast and high-quality method for generating consistent textures on complex 3D objects using text-to-image networks and 3D semantics.
http://arxiv.org/abs/2407.02425v1
Compressor summary: The text is a systematic review of how reinforcement learning can help achieve ethical behavior in autonomous systems.
http://arxiv.org/abs/2407.02424v1
Compressor summary: The authors propose a graphical language for designing and unifying machine learning tasks, and introduce "manipulators", a novel task that converts classifiers into generative models without custom architectures or adversarial training.
http://arxiv.org/abs/2407.02423v1
Compressor summary: The paper presents a new way to visualize and compare machine learning models using diagrams, and applies it to study different types of attention mechanisms in depth.
http://arxiv.org/abs/2407.02422v1
Compressor summary: The paper proposes CliqueMining, a novel mining strategy that improves Visual Place Recognition by selecting examples from visually similar image cliques, boosting recall@1 on two benchmarks.
http://arxiv.org/abs/2407.02411v1
Compressor summary: The paper proposes Video Watermarking, a technique to protect videos from unauthorized annotations by video-based LLMs, by embedding imperceptible watermarks into key frames and preserving the viewing experience.
http://arxiv.org/abs/2407.02408v1
Compressor summary: The authors propose CEB, a benchmark for evaluating various types of biases in large language models across different social groups and tasks, using a compositional taxonomy.
http://arxiv.org/abs/2407.02403v1
Compressor summary: The paper proposes a new method to reconstruct face images that can fool face recognition systems on unseen encoders by using Averaged Latent Search and Unsupervised Validation with pseudo target (ALSUV).
http://arxiv.org/abs/2407.02398v1
Compressor summary: Consistency-FM is a novel method that improves flow matching by enforcing self-consistency in the velocity field and using multi-segment training, resulting in faster convergence and better sample quality.
http://arxiv.org/abs/2407.02397v1
Compressor summary: The authors propose a refinement approach with feedback that separates identification of bad generations, feedback generation, and refining with feedback in large language models to improve factual consistency in document grounded summaries.
http://arxiv.org/abs/2407.02394v1
Compressor summary: The paper proposes a label assignment strategy called SimD for tiny object detection that considers location and shape similarity between bounding boxes and adapts to different datasets and object sizes.
http://arxiv.org/abs/2407.02392v1
Compressor summary: The proposed visual projector uses a coarse-to-fine scheme to generate condensed visual tokens for MLLMs, improving efficiency and reasoning capabilities.
http://arxiv.org/abs/2407.02389v1
Compressor summary: SafaRi is a weakly-supervised bootstrapping architecture for Referring Expression Segmentation that uses less annotations, improves image-text alignment, and performs well in unseen scenarios.
http://arxiv.org/abs/2407.02387v1
Compressor summary: The authors release a real hyperspectral image dataset to improve fusion algorithm development and comparison, as existing simulated datasets have inaccuracies.
http://arxiv.org/abs/2407.02386v1
Compressor summary: The paper introduces OpenSlot, a framework for open-set recognition that handles multiple class semantics and reduces noise, achieving state-of-the-art performance on conventional and mixed tasks.
http://arxiv.org/abs/2407.02371v1
Compressor summary: The authors introduce a new high-quality dataset (OpenVid-1M) for text-to-video generation, along with a novel transformer model (MVDiT) that leverages both visual and textual information.
http://arxiv.org/abs/2407.02370v1
Compressor summary: This paper explores using event-based cameras to create affordable slow-motion sports videos with deep learning techniques.
http://arxiv.org/abs/2407.02369v1
Compressor summary: Two-step Q-learning is a novel off-policy algorithm that converges almost surely to optimal Q-values, outperforming existing methods on benchmark problems.
http://arxiv.org/abs/2407.02361v1
Compressor summary: GCF is a novel approach that uses Graph Convolutional Networks to improve Facial Expression Recognition by enhancing local CNN features with global features, achieving significant performance improvements over state-of-the-art methods on benchmark datasets.
http://arxiv.org/abs/2407.02354v1
Compressor summary: The dissertation covers the author's research on dialogue systems, from modular architectures to end-to-end deep neural networks, and presents contributions to task-oriented dialogues, conversational QA, and large language models for multimodal dialogue.
http://arxiv.org/abs/2407.02352v1
Compressor summary: Pelican is a framework that detects and mitigates hallucinations in large visual language models by decomposing claims into sub-claims, generating Python code for answering questions, and verifying the correctness of the claim using reasoning abilities.
http://arxiv.org/abs/2407.02351v1
Compressor summary: The paper explores how large language models can help fact-checkers identify false information online by using their knowledge and reasoning skills.
http://arxiv.org/abs/2407.02350v1
Compressor summary: The paper introduces CoCoLe, a method to improve vision-language models' generalization by learning a codebook of visual concepts linked to text encoder inputs for few-shot classification tasks.
http://arxiv.org/abs/2407.02348v1
Compressor summary: The paper proposes a simple adaptive inference scheme called cascade of ensembles (CoE) that uses ensemble agreement to route examples through different models, achieving efficiency gains and reducing costs.
http://arxiv.org/abs/2407.02345v1
Compressor summary: MORPHEUS is a novel framework for generating personalized dialogues that uses a persona codebook to represent roles in latent space and improve response generation without external role data.
http://arxiv.org/abs/2407.02340v1
Compressor summary: The study proposes RVISA, a two-stage reasoning framework that combines generation and reasoning abilities of LLMs to identify implicit sentiment using three-hop reasoning prompting and a verification mechanism.
http://arxiv.org/abs/2407.02337v1
Compressor summary: This paper presents new resources and benchmarks to advance open-source foundation models for Azerbaijani language understanding and generation.
http://arxiv.org/abs/2407.02335v1
Compressor summary: CALICO is an active learning framework that self-calibrates confidence for sample selection in deep neural networks, improving classification performance with limited labeled data.
http://arxiv.org/abs/2407.02333v1
Compressor summary: The paper investigates a multilingual bias in vision-language models and suggests that switching the language backbone and intervening on attention layers can reduce it.
http://arxiv.org/abs/2407.02329v1
Compressor summary: The Multi-Instance Generation (MIG) task involves generating multiple instances in an image with specific attributes, and the proposed methods MIGC, MIGC++, and Consistent-MIG improve control, diversity, and consistency in this task.
http://arxiv.org/abs/2407.02328v1
Compressor summary: The paper proposes a method to improve the efficiency and scalability of large language models by adaptively sparsifying attention and rebuilding discarded tokens as needed, achieving significant throughput improvement in natural language generation tasks.
http://arxiv.org/abs/2407.02327v1
Compressor summary: QSync is a system that enables efficient DNN training on hybrid devices by selecting optimal quantized operators based on device resource capacities and synchronizing workers with minimized accuracy degradation.
http://arxiv.org/abs/2407.02322v1
Compressor summary: The text studies SGD dynamics for the least-square problem using SDEs in different settings and provides convergence rates, stationary distribution properties, and numerical simulations.
http://arxiv.org/abs/2407.02320v1
Compressor summary: Transliteration can improve low-resource language performance in LLMs across various tasks, but its effectiveness depends on the task and model size.
http://arxiv.org/abs/2407.02317v1
Compressor summary: The study explores how to improve cross-lingual NLP applications by using fine-tuning methods along with language-specific and task-specific adapters and soft prompts, finding that combining a soft language prompt with a task adapter often works best.
http://arxiv.org/abs/2407.02315v1
Compressor summary: VFIMamba is a novel frame interpolation method that uses Selective State Space Models (S6) to efficiently model intermediate frames in videos, achieving state-of-the-art performance in high-resolution scenarios.
http://arxiv.org/abs/2407.02310v1
Compressor summary: Large language models can perform well on semantics-aware process mining tasks after fine-tuning, while struggling without it.
http://arxiv.org/abs/2407.02309v1
Compressor summary: S-GEAR is a novel framework that learns action representations considering their semantic interconnectedness and improves action anticipation performance on several benchmarks.
http://arxiv.org/abs/2407.02302v1
Compressor summary: The study evaluates ChatGPT's ability to generate English paraphrases using different linguistic changes and introduces APTY, a dataset for improving language models.
http://arxiv.org/abs/2407.02301v1
Compressor summary: CFinBench is a benchmark to test Chinese LLMs' financial knowledge on various topics, tasks, and certifications with 99,100 questions.
http://arxiv.org/abs/2407.02286v1
Compressor summary: The paper proposes Selective Jittering and Learnable Point Drop data augmentation techniques to improve LiDAR semantic segmentation performance in adverse weather conditions by addressing geometric perturbation and point drop issues.
http://arxiv.org/abs/2407.02284v1
Compressor summary: Renard is a Python library for creating custom NLP pipelines to analyze dynamic and static networks of characters in narrative texts.
http://arxiv.org/abs/2407.02283v1
Compressor summary: Key points: - Feature upsampling is essential for image segmentation tasks - Existing similarity-based feature upsampling pipeline has limitations in alignment, similarity calculation, and neighbor selection - ReSFU addresses these limitations with improved feature alignment, flexible similarity calculation, and fine-grained neighbor selection - ReSFU works well on different architectures and segmentation applications Summary: ReSFU is a novel feature upsampling framework for image segmentation that overcomes the limitations of existing methods by aligning features better, calculating similarity flexibly, and selecting neighbors finely. It works well on various network structures and tasks.
http://arxiv.org/abs/2407.02279v1
Compressor summary: Boosting can optimize any loss function without needing first-order information or smoothness conditions, using tools from quantum calculus.
http://arxiv.org/abs/2407.02275v1
Compressor summary: This paper reviews various learning paradigms and methodologies used for creating digital twins in the process industry, and identifies challenges and future directions.
http://arxiv.org/abs/2407.02273v1
Compressor summary: The study explores how large language models make moral decisions across various languages and cultures, finding that their alignment with human preferences varies depending on the language.
http://arxiv.org/abs/2407.02272v1
Compressor summary: The authors propose a new method to generate more realistic human motions by introducing a large dataset and a model that captures human preferences, which can be integrated into the generation pipeline.
http://arxiv.org/abs/2407.02271v1
Compressor summary: The paper proposes a method to improve explainability and uncertainty estimation in softmax classifiers using prototype-based predictions and similarity learning.
http://arxiv.org/abs/2407.02265v1
Compressor summary: DrugCLIP is a contrastive learning method that uses machine learning to automatically discover new uses for existing drugs by analyzing drug and disease interactions in large datasets.
http://arxiv.org/abs/2407.02264v1
Compressor summary: Key points: - Paper proposes SOAF, a new approach for novel view audio-visual synthesis in indoor scenes with sound-propagation modelling and scene transmittance learning - SOAF generates binaural audio with directional attention using Fibonacci Sphere feature extraction - SOAF outperforms previous techniques on real and synthetic datasets Summary: The paper presents SOAF, a novel method for generating binaural audio in indoor scenes with accurate sound propagation and directional attention, achieving superior results on real and synthetic data.
http://arxiv.org/abs/2407.02263v1
Compressor summary: The FreeCG method improves the Clebsch-Gordan Transform layer by using permutation-invariant abstract edges, group CG transform, sparse paths, and attention enhancement to achieve better force and property predictions for molecular datasets.
http://arxiv.org/abs/2407.02258v1
Compressor summary: SiamTST is a new way to learn from multivariate time series data using a special type of neural network and some tricks to improve accuracy.
http://arxiv.org/abs/2407.02253v1
Compressor summary: The paper introduces a new method, Parameter-Selective Mean Teacher (PSMT), that updates only critical parameters in the Mean Teacher model to avoid error accumulation and catastrophic forgetting when adapting to changing environments.
http://arxiv.org/abs/2407.02252v1
Compressor summary: The authors propose an end-to-end text rendering framework for poster generation using a triple cross-attention mechanism and a high-resolution dataset, aiming to create precise and contextually rich poster images.
http://arxiv.org/abs/2407.02248v1
Compressor summary: The study proposes EvolBA, an adversarial attack method using CMA-ES under HL-BB condition, which can find AEs with smaller perturbations in images.
http://arxiv.org/abs/2407.02243v1
Compressor summary: RIO uses reinforcement learning from human feedback to select exemplars that improve the robustness and quality of zero-shot text-to-speech systems by leveraging reverse inference based on the Bayesian principle.
http://arxiv.org/abs/2407.02241v1
Compressor summary: The paper proposes a sign language recognition network that uses hand skeleton features and facial expressions to improve accuracy and robustness.
http://arxiv.org/abs/2407.02235v1
Compressor summary: The text describes a study that developed a large language model called BrainGPT to generate accurate and informative 3D brain CT reports by addressing data complexity, model capacity, and evaluation metric issues, and demonstrated its clinical readiness through physician evaluations and a proposed FORTE metric.
http://arxiv.org/abs/2407.02233v1
Compressor summary: SMMQG is a framework for generating synthetic question-answer pairs from multimodal documents to evaluate MMRAG models, achieving high quality comparable to existing benchmarks.
http://arxiv.org/abs/2407.02229v1
Compressor summary: LaMoD is a novel deep learning model that predicts accurate DENSE motions from standard CMR videos for improved myocardial strain analysis in cardiac patients.
http://arxiv.org/abs/2407.02228v1
Compressor summary: MTMamba is a new architecture for multi-task scene understanding that leverages Mamba to handle long-range dependency and model cross-task interactions, outperforming previous methods on NYUDv2 and PASCAL-Context datasets.
http://arxiv.org/abs/2407.02222v1
Compressor summary: The study evaluates an eye blink feature set to detect driver fatigue using camera-based solutions, which are non-intrusive and adapt to different drivers.
http://arxiv.org/abs/2407.02218v1
Compressor summary: MST-MIXER is a novel video dialog model that tracks multiple modalities and learns local latent graphs to improve performance on real-world scenarios.
http://arxiv.org/abs/2407.02217v1
Compressor summary: This paper shows how using partial physical knowledge can improve reinforcement learning by enhancing sample efficiency, inference speed, and planning in real-world applications.
http://arxiv.org/abs/2407.02211v1
Compressor summary: PromptIntern is a novel method that helps large language models learn prompt knowledge internally, reducing inference costs and increasing speed for complex natural language processing tasks.
http://arxiv.org/abs/2407.02209v1
Compressor summary: Generative monoculture is a phenomenon in large language models where they produce less diverse outputs than expected for certain tasks, which can have positive and negative consequences depending on the use case, and requires better alignment methods to avoid.
http://arxiv.org/abs/2407.02208v1
Compressor summary: The paper proposes a self-correction method to improve machine translation performance by using the model's prediction distribution to revise training supervision in the presence of semantic misalignment noise.
http://arxiv.org/abs/2407.02203v1
Compressor summary: The paper proposes using large language models to optimize rule-based self-adaptation systems by leveraging their common sense and reasoning abilities.
http://arxiv.org/abs/2407.02197v1
Compressor summary: The study proposes a method to improve autonomous vehicle performance in complex indoor environments like underground parking lots using CARLA's simulation platform and an occupancy grid network.
http://arxiv.org/abs/2407.02191v1
Compressor summary: The paper proposes methods to directly calibrate noise scale in differential privacy (DP) models based on attack risk, improving utility without sacrificing privacy.
http://arxiv.org/abs/2407.02188v1
Compressor summary: SACN is a novel graph node classification method that leverages structure-aware consensus learning between two augmented views, integrates structural information, and achieves strong performance especially at low label rates.
http://arxiv.org/abs/2407.02187v1
Compressor summary: The paper proposes a deep learning method using point-prompts for class-agnostic cell segmentation in the in vitro scratch assay, reducing subjectivity and increasing accuracy.
http://arxiv.org/abs/2407.02182v1
Compressor summary: The authors introduce OASS, a new task that addresses challenges in panoramic image segmentation, present the BlendPASS dataset, and propose UnmaskFormer, a solution that uses Unmasking Attention and Amodal-oriented Mix to achieve state-of-the-art results.
http://arxiv.org/abs/2407.02174v1
Compressor summary: The paper proposes a method to recover neural radiance fields (NeRF) from a single blurry image and its camera motion, enabling view-consistent sharp images and high-quality rendering.
http://arxiv.org/abs/2407.02165v1
Compressor summary: WildAvatar is a large dataset for creating 3D human avatars from YouTube videos, addressing the limitations of existing datasets and enabling real-world applications.
http://arxiv.org/abs/2407.02158v1
Compressor summary: UltraPixel is a novel architecture that efficiently generates high-quality images at multiple resolutions using cascade diffusion models, implicit neural representations, and scale-aware normalization layers.
http://arxiv.org/abs/2407.02157v1
Compressor summary: FineCLIPER is a novel framework that uses Multi-modal Fine-grained CLIP to recognize dynamic facial expressions with improved accuracy and adaptability by extending class labels, using hierarchical cues, and adopting Parameter-Efficient Fine-Tuning.
http://arxiv.org/abs/2407.02153v1
Compressor summary: The paper compares one-dimensional function approximation using shallow neural networks with ReLU activation and traditional methods like Free Knot Splines, and proposes a two-level training method for better performance.
http://arxiv.org/abs/2407.02150v1
Compressor summary: The VRBiom dataset contains periocular videos acquired using a VR headset for biometric applications, including iris and periocular recognition, with real and spoofed data.
http://arxiv.org/abs/2407.02147v1
Compressor summary: The authors introduce InstAr-500k, a new Arabic instruction dataset, which improves the performance of language models on various Arabic NLP tasks by fine-tuning existing models.
http://arxiv.org/abs/2407.02143v1
Compressor summary: CAGAD enhances anomaly detection in graphs by creating counterfactual node representations using a graph pointer neural network and a diffusion model.
http://arxiv.org/abs/2407.02138v1
Compressor summary: The paper proposes a new uncertainty estimation method for DNNs using k-Nearest Neighbor (kNN) that performs well in calibration, selective prediction, and out-of-distribution detection with low inference cost.
http://arxiv.org/abs/2407.02136v1
Compressor summary: Key points: - The text investigates how language models (LMs) learn adjective order preferences (AOPs) in English and other languages - AOPs involve complex ordering patterns of multiple adjectives in noun phrases that cross syntax, semantics, and pragmatics - The authors present a reusable corpus of adjective pairs and define AOP measures for LMs - They find that LMs perform better than theoretical linguistic factors but still show strong data frequency effects and limited generalization - They suggest future studies and discuss key questions about LM knowledge and generalization Summary: The text explores how LMs acquire AOPs, which are intricate rules of adjective ordering in languages, using a new corpus and measures. It reveals that LMs outperform linguistic theory but struggle with data frequency and generalization.
http://arxiv.org/abs/2407.02123v1
Compressor summary: HFCR-Net combines channel features and spatial features to improve few-shot fine-grained image classification by enhancing inter-class differences and reducing intra-class differences through a hybrid feature reconstruction process.
http://arxiv.org/abs/2407.02122v1
Compressor summary: The text is a survey about fake news detection datasets, emphasizing their importance for model performance, and providing a GitHub repository with publicly accessible datasets.
http://arxiv.org/abs/2407.02119v1
Compressor summary: RLHF relies on human feedback, but it's limited by the size of preference data; our approach uses online methods and proxy reward oracles to efficiently label preferences with minimal expert input.
http://arxiv.org/abs/2407.02118v1
Compressor summary: The paper explores constructing large language models for new languages by continually pretraining from existing models, showing faster convergence, resource savings, and different data-parameter allocation compared to training from scratch.
http://arxiv.org/abs/2407.02112v1
Compressor summary: The paper proposes a data-centric evaluation framework for tabular data models, showing that dataset-specific preprocessing and feature engineering are crucial factors affecting performance, and highlights the importance of test-time adaptation for dynamic data.
http://arxiv.org/abs/2407.02109v1
Compressor summary: HRSAM is a new interactive segmentation model that combines Flash Attention and PSCWin attention to handle high-resolution images and achieve low latency, outperforming previous models.
http://arxiv.org/abs/2407.02106v1
Compressor summary: The paper presents a framework for creating knowledge graphs from time series data to help with industrial decision-making and optimization, using Granger causality to find key attributes for predictive models.
http://arxiv.org/abs/2407.02104v1
Compressor summary: Key points: - Pose-estimation methods extract human motion from videos as 3D skeleton sequences - Text-motion retrieval tasks aim to search for motions or descriptions that match a given text - Joint-dataset learning and CCCL help improve text-motion models by using more data and imposing uni-modal constraints - MoT++ is a transformer-based encoder that uses spatio-temporal attention to process skeleton data - The paper evaluates the proposed methods on KIT Motion-Language and HumanML3D datasets Summary: The paper proposes joint-dataset learning, CCCL, and MoT++ to improve text-motion retrieval tasks that search for video motions or descriptions based on natural language.
http://arxiv.org/abs/2407.02099v1
Compressor summary: The paper explores how assigning different personas to large language models affects their behavior and compares it to a control setting with a generic "helpful assistant" and no persona.
http://arxiv.org/abs/2407.02098v1
Compressor summary: The paper proposes a post-training weight pruning scheme for 3D object detection that reduces computational cost and memory footprint while maintaining or enhancing detection precision.
http://arxiv.org/abs/2407.02091v1
Compressor summary: This paper studies how different binary labeling methods affect the performance of solving large-scale optimization problems using factorization machines with annealing, and proposes a new method called Gray labeling that improves convergence speed and accuracy for the traveling salesman problem.
http://arxiv.org/abs/2407.02089v1
Compressor summary: GPTCast is a generative deep-learning method that uses a large language model to learn spatiotemporal precipitation dynamics from radar images and produce realistic ensemble forecasts with accurate uncertainty estimation.
http://arxiv.org/abs/2407.02077v1
Compressor summary: HTCL is a novel method for improving camera-based semantic scene completion by learning hierarchical temporal context and adaptively refining feature sampling locations, achieving state-of-the-art results on benchmarks.
http://arxiv.org/abs/2407.02075v1
Compressor summary: Label Anything is a neural network architecture for few-shot semantic segmentation that uses different visual prompts and trains end-to-end across multi-class scenarios, improving adaptability and generalization.
http://arxiv.org/abs/2407.02070v1
Compressor summary: The authors propose a new deep learning method to quickly and efficiently generate multiple climate scenarios for uncertainty analysis.
http://arxiv.org/abs/2407.02068v1
Compressor summary: Key points: - Vision transformers (ViTs) are powerful but resource-intensive for image analysis tasks - Block-structured pruning leverages the block-wise structure of linear layers to reduce resource requirements - A hardware-aware learning objective is proposed to optimize the pruning scheme and balance accuracy and power consumption - Experiments on ImageNet show competitive performance and significant speedup and power reduction for ViTs Summary: The paper introduces a new pruning method for vision transformers that reduces resource usage by exploiting block-wise layer structure and optimizes it with a hardware-aware objective, achieving fast and energy-efficient image analysis.
http://arxiv.org/abs/2407.02067v1
Compressor summary: The study examines large multimodal models' ability to recognize and adapt across different cultural contexts using a new dataset, Dalle Street.
http://arxiv.org/abs/2407.02066v1
Compressor summary: The paper studies how Vision Language Models can have implicit social biases across nine dimensions and creates a dataset to identify and mitigate them.
http://arxiv.org/abs/2407.02062v1
Compressor summary: Data augmentation enhances confidence calibration and uncertainty estimation in Named Entity Recognition tasks using Deep Neural Networks.
http://arxiv.org/abs/2407.02060v1
Compressor summary: The paper introduces a new neuro-symbolic model that combines transformers and Tensor Product Representations, improves it by reducing parameters and introducing a mixture of experts, and proposes an automatic termination algorithm for controlling the number of steps in the computation.
http://arxiv.org/abs/2407.02057v1
Compressor summary: The paper proposes a new method for detecting anomalies in graphs using hypergraphs, node group connections, and hyperbolic geometry.
http://arxiv.org/abs/2407.02056v1
Compressor summary: Fine-Grained Self-Consistency (FSC) enhances LLMs' performance in both open-ended and reasoning tasks by integrating segment-level commonalities from candidate samples, and introduces two strategies to further improve output quality.
http://arxiv.org/abs/2407.02055v1
Compressor summary: The paper explores how dialectical frameworks and Boolean regulatory networks, two different models from argumentation and biology respectively, have similarities in appearance and can be related to produce new insights.
http://arxiv.org/abs/2407.02047v1
Compressor summary: CountFormer is a 3D multi-view counting framework that handles different camera layouts and achieves superior performance using scene-level volume representation and attention mechanism.
http://arxiv.org/abs/2407.02043v1
Compressor summary: The authors propose two methods to compress tool documentation for language models, reducing input length and decoding time while preserving key information.
http://arxiv.org/abs/2407.02042v1
Compressor summary: Key points: - Paper proposes manipulation reasoning, a new research topic for detecting fake news based on news content - Introduces HFFN, a benchmark with human-centric and fact-related domains and detailed annotations - Presents M-DRUM, a multi-modal model that extracts fusion features and raises analytical reasoning about manipulations - Outperforms SOTA models and LVLMs like GPT-4 and LLaVA Summary: The paper introduces manipulation reasoning, a new way to detect fake news by reasoning about content manipulations. It also presents HFFN, a benchmark with realistic domains and annotations, and M-DRUM, a multi-modal model that beats existing methods.
http://arxiv.org/abs/2407.02040v1
Compressor summary: ASD is a text-to-3D method that uses diffusion models to synthesize 3D contents faster and more accurately by adjusting the model's timestep, enabling it to handle large amounts of text prompts.
http://arxiv.org/abs/2407.02039v1
Compressor summary: The authors propose a Prompt Stability Score (PSS) metric and a Python package to measure and improve the reproducibility of language models for text annotation tasks.
http://arxiv.org/abs/2407.02038v1
Compressor summary: The paper proposes CL-Gait, a cross-modality gait recognition framework between cameras and LiDARs, using a two-stream network and contrastive pre-training with virtual data generation.
http://arxiv.org/abs/2407.02034v1
Compressor summary: Our paper proposes a progressive 3D editing strategy with Trajectory-Anchored Scheme and dual-branch editing mechanism to ensure multi-view consistency and improve editing quality in text-guided 3D scene editing.
http://arxiv.org/abs/2407.02030v1
Compressor summary: The text explores a debiasing technique for large language models using the Contact Hypothesis, which involves simulating social contact through prompts and instruction-tuning with unbiased responses to reduce prejudices in LLMs.
http://arxiv.org/abs/2407.02028v1
Compressor summary: The study evaluates in-context learning for open and closed questions of varying difficulty and novelty, revealing a counter-intuitive effect of context relevancy on hard and novel questions.
http://arxiv.org/abs/2407.02025v1
Compressor summary: The paper presents a message-passing neural network architecture called EGENNET that can separate non-equivalent geometric graphs based on local features and rotation equivariance, improving upon previous methods for connected graphs and globally rigid graphs.
http://arxiv.org/abs/2407.02014v1
Compressor summary: The paper proposes a new contrastive learning method that learns multi-grained representations for better generalization on various downstream tasks, outperforming existing methods without using large-scale pretraining data.
http://arxiv.org/abs/2407.02013v1
Compressor summary: Key points: - The paper introduces DiGRAF, a new activation function for graph neural networks (GNNs) - DiGRAF is based on continuous piecewise-affine transformations and learns to adapt to different graphs - DiGRAF has desirable properties such as differentiability, boundness and efficiency - Experiments show that DiGRAF outperforms other activation functions for GNNs Summary: DiGRAF is a novel activation function for GNNs that learns to adapt to graph data using continuous piecewise-affine transformations and has desirable properties, achieving superior performance in experiments.
http://arxiv.org/abs/2407.02005v1
Compressor summary: Key points: - SSum generates text summaries from spoken content - Challenges include long speech input and cross-modal mapping - Proposed model uses Q-Former and LLMs for audio-text fusion and summary generation - Multi-stage training with ASR and TSum as auxiliary tasks - Curriculum learning helps transition from TSum to SSum - Competitive performance on How-2 dataset Summary: The paper presents a new SSum model that uses Q-Former and LLMs to fuse audio and text, trains with ASR and TSum tasks, and employs curriculum learning for transitioning between tasks. It performs well on the How-2 dataset.
http://arxiv.org/abs/2407.02004v1
Compressor summary: The study presents SAVE, a lightweight approach that adapts the pre-trained SAM model for audio-visual segmentation by using image and audio encoder adapters to improve fusion and speed, achieving higher performance on real data than previous methods.
http://arxiv.org/abs/2407.01996v1
Compressor summary: Visually Grounded Bias Discovery and Mitigation (ViG-Bias) improves bias detection and reduction in machine learning models by using visual explanations from large vision language models.
http://arxiv.org/abs/2407.01994v1
Compressor summary: This work proposes three techniques to enhance rule sets for Neuro-Symbolic Knowledge Graph Completion models, achieving significant improvements in coverage and performance.
http://arxiv.org/abs/2407.01992v1
Compressor summary: The authors use graph mining to create a contrast set from MCQA datasets and show that large language models do not rely on choices-only shortcuts for high performance in multiple-choice question answering.
http://arxiv.org/abs/2407.01991v1
Compressor summary: The paper proposes a method to find shortest paths on continuous surfaces using recursive midpoint prediction and an actor-critic learning approach, which is theoretically sound and performs better than previous methods in local and global path planning.
http://arxiv.org/abs/2407.01987v1
Compressor summary: AHMsys is a system that automates creating 3D HVAC models from 2D CAD drawings, reducing BIM process time by 20 percent.
http://arxiv.org/abs/2407.01983v1
Compressor summary: The paper introduces SADL, a visual-linguistic prompting framework for in-context learning in Visual QA that samples, decomposes, and pseudo-labels image-question pairs to bridge the semantic gap between symbols and images.
http://arxiv.org/abs/2407.01979v1
Compressor summary: Key points: - Graph Neural Networks (GNNs) are widely used for graph mining tasks - Existing explanations focus on subgraph features and local structures, but graph classification requires global interactions - The paper proposes Global Interactive Pattern (GIP) learning, which introduces learnable global patterns to interpret decisions - GIP uses graph clustering and prototype matching to facilitate transparent graph-level reasoning - Experiments show that GIP improves interpretability and performs well on synthetic and real-world data Summary: The paper presents a novel interpretation scheme for graph classification with GNNs, called GIP, which learns global interactive patterns using graph clustering and prototype matching.
http://arxiv.org/abs/2407.01976v1
Compressor summary: LayTextLLM is a new method for document understanding that improves performance in Key Information Extraction and Visual Question Answering by interleaving text and spatial layout embeddings without long sequence issues.
http://arxiv.org/abs/2407.01971v1
Compressor summary: MPV-Net uses diverse refining policies to generate trusted pseudo labels for image cropping models using labeled and unlabeled data, achieving state-of-the-art results.
http://arxiv.org/abs/2407.01967v1
Compressor summary: The paper proposes a novel pretraining paradigm with soft labels and a metric with adaptability to improve few-shot classification using local representations.
http://arxiv.org/abs/2407.01965v1
Compressor summary: The paper proposes a novel framework called AdaCQR that improves conversational search by aligning reformulation models with different types of retrieval systems and using efficient techniques to acquire better labels and input candidates.
http://arxiv.org/abs/2407.01964v1
Compressor summary: The paper proposes a new reasoning framework (ADAPT) to adapt large language models for better legal judgment prediction by understanding case facts, discriminating charges, and predicting judgments.
http://arxiv.org/abs/2407.01960v1
Compressor summary: The paper introduces a framework for zero-shot video restoration and enhancement using a pre-trained image diffusion model with a cross-previous-frame attention layer and other techniques to reduce temporal flickering artifacts and improve quality.
http://arxiv.org/abs/2407.01959v1
Compressor summary: The paper proposes FlowTrack, a point-level flow method for 3D single object tracking that captures local motion details, handles sparse points with a learnable target feature, and aggregates local motion information into global motion using an Instance Flow Head.
http://arxiv.org/abs/2407.01955v1
Compressor summary: The paper proposes a new method to speed up large language models' decoding process by using sorted speculative decoding with multiple draft models for different target models, reducing costs and improving performance.
http://arxiv.org/abs/2407.01948v1
Compressor summary: Key points: - The text presents a two-stage framework to extract factual statements from radiology reports and improve text encoders for various downstream tasks. - The framework consists of a Fact Extractor using LLMs and a Fact Encoder based on a BERT model fine-tuned with objective functions. - The text also introduces a new metric (CXRFEScore) for evaluating chest X-ray text generation systems. Summary: The text describes a framework that uses factual statements from radiology reports to enhance text encoders and evaluate chest X-ray text generation systems.
http://arxiv.org/abs/2407.01945v1
Compressor summary: The paper presents a simple and reliable method for calibrating a camera-projector pair (CPP) using an unknown cuboid corner, enabling direct 3D reconstruction in indoor scenes with weak textures.
http://arxiv.org/abs/2407.01942v1
Compressor summary: The paper proposes a taxonomy of uncertainty in vision-language AI systems, creates a dataset with contrastive examples, and introduces a new metric for measuring calibration error.
http://arxiv.org/abs/2407.01937v1
Compressor summary: The paper introduces Efficient-Empathy, an algorithm that selects high-quality empathetic data to improve large language models' performance and efficiency in empathetic dialogues.
http://arxiv.org/abs/2407.01931v1
Compressor summary: SPI-CorrNet is a deep learning model that predicts 3D correspondences from sparse imaging data, improving the accuracy and robustness of statistical shape modeling in clinical research.
http://arxiv.org/abs/2407.01930v1
Compressor summary: The paper proposes a Self-Cooperation Knowledge Distillation (SCKD) method to balance reviewing known classes and discovering novel classes in unsupervised learning.
http://arxiv.org/abs/2407.01929v1
Compressor summary: The paper examines the evolution of the term "Language Models" in scientific discourse and calls for a new perspective on how systems and theories influence each other.
http://arxiv.org/abs/2407.01928v1
Compressor summary: SymPoint-V2 is a new, improved method for recognizing symbols in CAD drawings that uses layer information and faster training to outperform its predecessor SymPoint.
http://arxiv.org/abs/2407.01925v1
Compressor summary: Key points: - Paper proposes a new optimization concept, Looking From the Future (LFF), for adversarial attacks - LFF improves generalization from different perspectives and transfers better to unseen tasks - Paper also introduces multi-order attack method, $LLF^{\mathcal{N}}$, that extends LFF to more tasks - Experiments on ImageNet1k dataset show significant enhancement of attack transferability Summary: The paper introduces a novel optimization concept, LFF, for adversarial attacks that improves generalization and transferability. It also presents a multi-order attack method, $LLF^{\mathcal{N}}$, and shows its effectiveness on ImageNet1k dataset.
http://arxiv.org/abs/2407.01921v1
Compressor summary: GVDIFF is a text-to-video framework that uses uncertainty-based representations, spatial-temporal grounding, and dynamic gates to generate videos guided by text with different applications.
http://arxiv.org/abs/2407.01920v1
Compressor summary: Key points: - Large Language Models (LLMs) can have sensitive data - Knowledge unlearning tries to erase specific knowledge from LLMs - Current methods are not precise and may lose essential knowledge - MemFlex is a new method that uses gradient information to target sensitive parameters - MemFlex performs better than existing methods in unlearning and retaining knowledge Summary: MemFlex is a novel method that uses gradients to precisely erase sensitive data from LLMs without losing essential knowledge, unlike current imprecise unlearning approaches.
http://arxiv.org/abs/2407.01916v1
Compressor summary: The text discusses ranking manipulation using pairwise comparisons and proposes distributionally robust methods to attack and defend against such manipulations.
http://arxiv.org/abs/2407.01911v1
Compressor summary: The authors developed a method to convert single-channel speech recordings into pseudo-stereo data, which increased the training dataset size and improved the performance of spoken dialogue language models.
http://arxiv.org/abs/2407.01910v1
Compressor summary: Key points: - Large Language Models (LLMs) can help streamline hardware design processes - Existing hardware datasets are often limited and need to be improved - Proposed a Multi-Grained-Verilog dataset with criteria for quality - Introduced a balanced fine-tuning scheme to leverage the dataset's diversity Summary: The authors propose a high-quality Multi-Grained-Verilog dataset and a balanced fine-tuning scheme to enhance LLMs in hardware design tasks, addressing the limitations of existing hardware datasets.
http://arxiv.org/abs/2407.01909v1
Compressor summary: The paper introduces a new Chinese ASR error correction dataset and shows that Pinyin regularization improves the performance of large language models in this task.
http://arxiv.org/abs/2407.01907v1
Compressor summary: The paper presents a new method for answering questions about videos that uses two steps: first, it uses VALOR to answer questions based on video information, and second, it uses TubeDETR to find bounding boxes of objects in the video.
http://arxiv.org/abs/2407.01906v1
Compressor summary: The paper explores parameter-efficient fine-tuning for sparse-architecture large language models with a Mixture-of-Experts architecture, proposing Expert-Specialized Fine-Tuning that improves tuning efficiency and performance.
http://arxiv.org/abs/2407.01905v1
Compressor summary: The paper proposes a diffusion model and a transformer combined approach for multi-class anomaly detection in industry, which improves accuracy and avoids common problems such as blurry reconstruction and identical shortcuts.
http://arxiv.org/abs/2407.01903v1
Compressor summary: TADPoLe is a method that uses pretrained generative models to learn policies from natural language without explicit rewards or demonstrations, achieving natural and diverse behaviors in various robotic domains.
http://arxiv.org/abs/2407.01899v1
Compressor summary: The AMS parser is a neurosymbolic semantic parser for DRT that excels at handling complex sentences by predicting quantifier scope.
http://arxiv.org/abs/2407.01897v1
Compressor summary: The paper presents a document summarization method that uses auxiliary information to efficiently summarize content related to described objects in texts, achieving top scores in two tracks of the SciCAP competition.
http://arxiv.org/abs/2407.01896v1
Compressor summary: LogEval is a benchmark suite to evaluate how well large language models perform in various log analysis tasks for AIOps.
http://arxiv.org/abs/2407.01894v1
Compressor summary: The paper presents a brain-eye-computer system for detecting dim targets in aerial images using EEG and computer vision, and proposes an adaptive modality balanced online knowledge distillation (AMBOKD) method to fuse EEG and image features effectively.
http://arxiv.org/abs/2407.01892v1
Compressor summary: The paper introduces GRASP, a large-scale benchmark for evaluating commonsense spatial reasoning in language models, and shows that current advanced LLMs perform poorly on it.
http://arxiv.org/abs/2407.01887v1
Compressor summary: The paper evaluates LLMs' decision-making abilities in Dueling Bandits, compares their performance to existing algorithms, and proposes an augmented algorithm that combines LLMs' strengths with classic DB guarantees.
http://arxiv.org/abs/2407.01886v1
Compressor summary: The paper proposes a novel algorithm that learns the core subgraph of a graph to improve adaptability, scalability, and generalizability in graph classification tasks.
http://arxiv.org/abs/2407.01885v1
Compressor summary: Key points: - Large Language Models (LLMs) are powerful but computationally expensive - Knowledge distillation is a technique to compress LLMs while preserving accuracy - The paper surveys various knowledge distillation methods, evaluation tasks, and applications for LLMs Summary: The paper reviews different ways of compressing large language models using knowledge distillation techniques, as well as how to evaluate and apply them.
http://arxiv.org/abs/2407.01884v1
Compressor summary: The authors propose a large multi-modal EEG dataset, EIT-1M, with over 1 million pairs of images, texts, and brain activity recordings to study how the brain processes multiple modalities simultaneously.
http://arxiv.org/abs/2407.01878v1
Compressor summary: Separability is a meta-evaluation measure that estimates how suitable a test instance is for pairwise preference evaluation by measuring the distinguishability of multiple generations from a model pair.
http://arxiv.org/abs/2407.01875v1
Compressor summary: This paper reviews various counterfactual models for AI and proposes a unified graphical approach to handle spatial and temporal interactions.
http://arxiv.org/abs/2407.01873v1
Compressor summary: The study explores using open-source, small-scale generative language models for automated text scoring and feedback generation on modest hardware.
http://arxiv.org/abs/2407.01872v1
Compressor summary: The paper introduces RAVAR, a new task to recognize atomic actions of a specific person based on text and video, and presents RefAtomNet, a novel method that uses cross-stream attention to address the challenges of this task.
http://arxiv.org/abs/2407.01866v1
Compressor summary: Image-GS is a novel image representation using anisotropic 2D Gaussians that enables content-adaptive rendering with high memory efficiency, fast random access, and a natural level of detail stack for various applications.
http://arxiv.org/abs/2407.01864v1
Compressor summary: The study presents a more efficient and accurate YOLOv8-based method for detecting and classifying distracted driving behaviour by integrating BoTNet, GAM attention and EIoU loss.
http://arxiv.org/abs/2407.01863v1
Compressor summary: The study introduces VSP, a benchmark to evaluate visual spatial planning capabilities of vision language models, and reveals their deficiencies in perception, reasoning, and general planning tasks.