arxiv compressed, 2024-07-03

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-03 generated by the compressor, my personal LLM-based project.


MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

http://arxiv.org/abs/2407.02490v1

Compressor summary: Minference is a sparse calculation method for efficient inference of long-context LLMs that reduces latency by up to 10x while maintaining accuracy.


Magic Insert: Style-Aware Drag-and-Drop

http://arxiv.org/abs/2407.02489v1

Compressor summary: Magic Insert is a technique that lets users insert realistic objects from one image into another with different style by fine-tuning a text-to-image model, infusing it with the target style, and adapting an object insertion model to diverse artistic styles.


Neurocache: Efficient Vector Retrieval for Long-range Language Modeling

http://arxiv.org/abs/2407.02486v1

Compressor summary: Neurocache extends the context of large language models using a cache to store past states, improving inference speed and accuracy in various tasks.


RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

http://arxiv.org/abs/2407.02485v1

Compressor summary: The paper proposes RankRAG, a framework that fine-tunes large language models for ranking contexts and generating answers in retrieval-augmented generation tasks, achieving better performance than existing models with less data.


MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

http://arxiv.org/abs/2407.02483v1

Compressor summary: The paper presents MMedAgent, an AI agent for the medical domain that selects appropriate specialized models as tools based on user inputs and outperforms existing methods.


Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

http://arxiv.org/abs/2407.02482v1

Compressor summary: The paper introduces a new method, Rich-contextual Conditional Diffusion Models (RCDMs), that improves story generation by using semantic and temporal context from known clips and reference images.


Understanding Alignment in Multimodal LLMs: A Comprehensive Study

http://arxiv.org/abs/2407.02477v1

Compressor summary: The paper analyzes preference alignment methods for Multimodal Large Language Models (MLLMs), compares offline and online approaches, introduces a new dataset creation method called Bias-Driven Hallucination Sampling (BDHS), and shows its competitive performance.


Scalable Multi-Output Gaussian Processes with Stochastic Variational Inference

http://arxiv.org/abs/2407.02476v1

Compressor summary: The paper proposes a method to speed up computations in a model that uses latent variables to capture covariance between multiple outputs.


Free Energy in a Circumplex Model of Emotion

http://arxiv.org/abs/2407.02474v1

Compressor summary: The paper proposes a two-dimensional model of emotion based on free energy, valence, and arousal, and demonstrates its application in simulating agents' emotions during a search task.


ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions

http://arxiv.org/abs/2407.02472v1

Compressor summary: ValueScope is a framework that uses language models to analyze and compare social norms across different online communities, revealing their diversity and evolution.


PWM: Policy Learning with Large World Models

http://arxiv.org/abs/2407.02466v1

Compressor summary: Policy Learning with large World Models (PWM) is a new model-based RL algorithm that learns continuous control policies from large multi-task world models, enabling efficient solution of complex tasks with many actions and tasks without the need for online planning.


Belief sharing: a blessing or a curse

http://arxiv.org/abs/2407.02465v1

Compressor summary: The paper explores how agents can communicate their beliefs more effectively to avoid echo chambers and self-doubt in collaborative tasks.


SUPER: Seated Upper Body Pose Estimation using mmWave Radars

http://arxiv.org/abs/2407.02455v1

Compressor summary: SUPER is a framework that uses dual-mmWave radars to estimate seated upper body human poses and outperforms existing methods by a large margin.


Ensemble of pre-trained language models and data augmentation for hate speech detection from Arabic tweets

http://arxiv.org/abs/2407.02448v1

Compressor summary: The study proposes a new method for detecting hate speech in Arabic tweets using ensemble learning and semi-supervised learning, which improves accuracy over existing approaches.


PLeaS -- Merging Models with Permutations and Least Squares

http://arxiv.org/abs/2407.02447v1

Compressor summary: The paper proposes a new algorithm, PLeaS, that can merge different machine learning models without sharing data or base architecture, improving performance by 8 to 15 percentage points for certain tasks.


Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modeling

http://arxiv.org/abs/2407.02446v1

Compressor summary: RLHF models excel at text generation but struggle with world modeling due to their reliance on implicit blueprints for coherence.


Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

http://arxiv.org/abs/2407.02445v1

Compressor summary: AssetGen is a text-to-3D generation system that produces high-quality meshes with realistic textures and materials using few views and human preferable results.


Predicting Visual Attention in Graphic Design Documents

http://arxiv.org/abs/2407.02439v1

Compressor summary: The paper presents a model that predicts how people pay attention to graphic design documents, considering both the spatial and temporal aspects of visual fixation using deep learning techniques.


Parameter Matching Attack: Enhancing Practical Applicability of Availability Attacks

http://arxiv.org/abs/2407.02437v1

Compressor summary: The paper proposes Parameter Matching Attack (PMA), a new availability attack for machine learning models that can degrade their performance even when only partially perturbed data is used.


Evaluating the Robustness of Adverse Drug Event Classification Models Using Templates

http://arxiv.org/abs/2407.02432v1

Compressor summary: The text discusses the challenges of evaluating ADE detection models in social media using hand-crafted templates for four capabilities.


On the Robustness of Graph Reduction Against GNN Backdoor

http://arxiv.org/abs/2407.02431v1

Compressor summary: Graph reduction methods' effectiveness in mitigating backdoor attacks on GNNs varies significantly and some even worsen the situation, raising concerns about security trade-offs in scalable GNN training.


Meta 3D TextureGen: Fast and Consistent Texture Generation for 3D Objects

http://arxiv.org/abs/2407.02430v1

Compressor summary: Meta 3D TextureGen is a fast and high-quality method for generating consistent textures on complex 3D objects using text-to-image networks and 3D semantics.


Reinforcement Learning and Machine ethics:a systematic review

http://arxiv.org/abs/2407.02425v1

Compressor summary: The text is a systematic review of how reinforcement learning can help achieve ethical behavior in autonomous systems.


A Pattern Language for Machine Learning Tasks

http://arxiv.org/abs/2407.02424v1

Compressor summary: The authors propose a graphical language for designing and unifying machine learning tasks, and introduce "manipulators", a novel task that converts classifiers into generative models without custom architectures or adversarial training.


On the Anatomy of Attention

http://arxiv.org/abs/2407.02423v1

Compressor summary: The paper presents a new way to visualize and compare machine learning models using diagrams, and applies it to study different types of attention mechanisms in depth.


Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition

http://arxiv.org/abs/2407.02422v1

Compressor summary: The paper proposes CliqueMining, a novel mining strategy that improves Visual Place Recognition by selecting examples from visually similar image cliques, boosting recall@1 on two benchmarks.


Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

http://arxiv.org/abs/2407.02411v1

Compressor summary: The paper proposes Video Watermarking, a technique to protect videos from unauthorized annotations by video-based LLMs, by embedding imperceptible watermarks into key frames and preserving the viewing experience.


CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

http://arxiv.org/abs/2407.02408v1

Compressor summary: The authors propose CEB, a benchmark for evaluating various types of biases in large language models across different social groups and tasks, using a compositional taxonomy.


Face Reconstruction Transfer Attack as Out-of-Distribution Generalization

http://arxiv.org/abs/2407.02403v1

Compressor summary: The paper proposes a new method to reconstruct face images that can fool face recognition systems on unseen encoders by using Averaged Latent Search and Unsupervised Validation with pseudo target (ALSUV).


Consistency Flow Matching: Defining Straight Flows with Velocity Consistency

http://arxiv.org/abs/2407.02398v1

Compressor summary: Consistency-FM is a novel method that improves flow matching by enforcing self-consistency in the velocity field and using multi-segment training, resulting in faster convergence and better sample quality.


Learning to Refine with Fine-Grained Natural Language Feedback

http://arxiv.org/abs/2407.02397v1

Compressor summary: The authors propose a refinement approach with feedback that separates identification of bad generations, feedback generation, and refining with feedback in large language models to improve factual consistency in document grounded summaries.


Similarity Distance-Based Label Assignment for Tiny Object Detection

http://arxiv.org/abs/2407.02394v1

Compressor summary: The paper proposes a label assignment strategy called SimD for tiny object detection that considers location and shape similarity between bounding boxes and adapts to different datasets and object sizes.


TokenPacker: Efficient Visual Projector for Multimodal LLM

http://arxiv.org/abs/2407.02392v1

Compressor summary: The proposed visual projector uses a coarse-to-fine scheme to generate condensed visual tokens for MLLMs, improving efficiency and reasoning capabilities.


SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

http://arxiv.org/abs/2407.02389v1

Compressor summary: SafaRi is a weakly-supervised bootstrapping architecture for Referring Expression Segmentation that uses less annotations, improves image-text alignment, and performs well in unseen scenarios.


Real HSI-MSI-PAN image dataset for the hyperspectral/multi-spectral/panchromatic image fusion and super-resolution fields

http://arxiv.org/abs/2407.02387v1

Compressor summary: The authors release a real hyperspectral image dataset to improve fusion algorithm development and comparison, as existing simulated datasets have inaccuracies.


OpenSlot: Mixed Open-set Recognition with Object-centric Learning

http://arxiv.org/abs/2407.02386v1

Compressor summary: The paper introduces OpenSlot, a framework for open-set recognition that handles multiple class semantics and reduces noise, achieving state-of-the-art performance on conventional and mixed tasks.


OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

http://arxiv.org/abs/2407.02371v1

Compressor summary: The authors introduce a new high-quality dataset (OpenVid-1M) for text-to-video generation, along with a novel transformer model (MVDiT) that leverages both visual and textual information.


Investigating Event-Based Cameras for Video Frame Interpolation in Sports

http://arxiv.org/abs/2407.02370v1

Compressor summary: This paper explores using event-based cameras to create affordable slow-motion sports videos with deep learning techniques.


Two-Step Q-Learning

http://arxiv.org/abs/2407.02369v1

Compressor summary: Two-step Q-learning is a novel off-policy algorithm that converges almost surely to optimal Q-values, outperforming existing methods on benchmark problems.


GCF: Graph Convolutional Networks for Facial Expression Recognition

http://arxiv.org/abs/2407.02361v1

Compressor summary: GCF is a novel approach that uses Graph Convolutional Networks to improve Facial Expression Recognition by enhancing local CNN features with global features, achieving significant performance improvements over state-of-the-art methods on benchmark datasets.


Talking to Machines: do you read me?

http://arxiv.org/abs/2407.02354v1

Compressor summary: The dissertation covers the author's research on dialogue systems, from modular architectures to end-to-end deep neural networks, and presents contributions to task-oriented dialogues, conversational QA, and large language models for multimodal dialogue.


Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification

http://arxiv.org/abs/2407.02352v1

Compressor summary: Pelican is a framework that detects and mitigates hallucinations in large visual language models by decomposing claims into sub-claims, generating Python code for answering questions, and verifying the correctness of the claim using reasoning abilities.


Generative Large Language Models in Automated Fact-Checking: A Survey

http://arxiv.org/abs/2407.02351v1

Compressor summary: The paper explores how large language models can help fact-checkers identify false information online by using their knowledge and reasoning skills.


Conceptual Codebook Learning for Vision-Language Models

http://arxiv.org/abs/2407.02350v1

Compressor summary: The paper introduces CoCoLe, a method to improve vision-language models' generalization by learning a codebook of visual concepts linked to text encoder inputs for few-shot classification tasks.


Revisiting Cascaded Ensembles for Efficient Inference

http://arxiv.org/abs/2407.02348v1

Compressor summary: The paper proposes a simple adaptive inference scheme called cascade of ensembles (CoE) that uses ensemble agreement to route examples through different models, achieving efficiency gains and reducing costs.


MORPHEUS: Modeling Role from Personalized Dialogue History by Exploring and Utilizing Latent Space

http://arxiv.org/abs/2407.02345v1

Compressor summary: MORPHEUS is a novel framework for generating personalized dialogues that uses a persona codebook to represent roles in latent space and improve response generation without external role data.


RVISA: Reasoning and Verification for Implicit Sentiment Analysis

http://arxiv.org/abs/2407.02340v1

Compressor summary: The study proposes RVISA, a two-stage reasoning framework that combines generation and reasoning abilities of LLMs to identify implicit sentiment using three-hop reasoning prompting and a verification mechanism.


Open foundation models for Azerbaijani language

http://arxiv.org/abs/2407.02337v1

Compressor summary: This paper presents new resources and benchmarks to advance open-source foundation models for Azerbaijani language understanding and generation.


CALICO: Confident Active Learning with Integrated Calibration

http://arxiv.org/abs/2407.02335v1

Compressor summary: CALICO is an active learning framework that self-calibrates confidence for sample selection in deep neural networks, improving classification performance with limited labeled data.


Why do LLaVA Vision-Language Models Reply to Images in English?

http://arxiv.org/abs/2407.02333v1

Compressor summary: The paper investigates a multilingual bias in vision-language models and suggests that switching the language backbone and intervening on attention layers can reduce it.


MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis

http://arxiv.org/abs/2407.02329v1

Compressor summary: The Multi-Instance Generation (MIG) task involves generating multiple instances in an image with specific attributes, and the proposed methods MIGC, MIGC++, and Consistent-MIG improve control, diversity, and consistency in this task.


Efficient Sparse Attention needs Adaptive Token Release

http://arxiv.org/abs/2407.02328v1

Compressor summary: The paper proposes a method to improve the efficiency and scalability of large language models by adaptively sparsifying attention and rebuilding discarded tokens as needed, achieving significant throughput improvement in natural language generation tasks.


QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

http://arxiv.org/abs/2407.02327v1

Compressor summary: QSync is a system that enables efficient DNN training on hybrid devices by selecting optimal quantized operators based on device resource capacities and synchronizing workers with minimized accuracy degradation.


Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

http://arxiv.org/abs/2407.02322v1

Compressor summary: The text studies SGD dynamics for the least-square problem using SDEs in different settings and provides convergence rates, stationary distribution properties, and numerical simulations.


Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts

http://arxiv.org/abs/2407.02320v1

Compressor summary: Transliteration can improve low-resource language performance in LLMs across various tasks, but its effectiveness depends on the task and model size.


Soft Language Prompts for Language Transfer

http://arxiv.org/abs/2407.02317v1

Compressor summary: The study explores how to improve cross-lingual NLP applications by using fine-tuning methods along with language-specific and task-specific adapters and soft prompts, finding that combining a soft language prompt with a task adapter often works best.


VFIMamba: Video Frame Interpolation with State Space Models

http://arxiv.org/abs/2407.02315v1

Compressor summary: VFIMamba is a novel frame interpolation method that uses Selective State Space Models (S6) to efficiently model intermediate frames in videos, achieving state-of-the-art performance in high-resolution scenarios.


Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks

http://arxiv.org/abs/2407.02310v1

Compressor summary: Large language models can perform well on semantics-aware process mining tasks after fine-tuning, while struggling without it.


Semantically Guided Representation Learning For Action Anticipation

http://arxiv.org/abs/2407.02309v1

Compressor summary: S-GEAR is a novel framework that learns action representations considering their semantic interconnectedness and improves action anticipation performance on several benchmarks.


Towards Human Understanding of Paraphrase Types in ChatGPT

http://arxiv.org/abs/2407.02302v1

Compressor summary: The study evaluates ChatGPT's ability to generate English paraphrases using different linguistic changes and introduces APTY, a dataset for improving language models.


CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

http://arxiv.org/abs/2407.02301v1

Compressor summary: CFinBench is a benchmark to test Chinese LLMs' financial knowledge on various topics, tasks, and certifications with 99,100 questions.


Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather

http://arxiv.org/abs/2407.02286v1

Compressor summary: The paper proposes Selective Jittering and Learnable Point Drop data augmentation techniques to improve LiDAR semantic segmentation performance in adverse weather conditions by addressing geometric perturbation and point drop issues.


Renard: A Modular Pipeline for Extracting Character Networks from Narrative Texts

http://arxiv.org/abs/2407.02284v1

Compressor summary: Renard is a Python library for creating custom NLP pipelines to analyze dynamic and static networks of characters in narrative texts.


A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling

http://arxiv.org/abs/2407.02283v1

Compressor summary: Key points: - Feature upsampling is essential for image segmentation tasks - Existing similarity-based feature upsampling pipeline has limitations in alignment, similarity calculation, and neighbor selection - ReSFU addresses these limitations with improved feature alignment, flexible similarity calculation, and fine-grained neighbor selection - ReSFU works well on different architectures and segmentation applications Summary: ReSFU is a novel feature upsampling framework for image segmentation that overcomes the limitations of existing methods by aligning features better, calculating similarity flexibly, and selecting neighbors finely. It works well on various network structures and tasks.


How to Boost Any Loss Function

http://arxiv.org/abs/2407.02279v1

Compressor summary: Boosting can optimize any loss function without needing first-order information or smoothness conditions, using tools from quantum calculus.


Learning Paradigms and Modelling Methodologies for Digital Twins in Process Industry

http://arxiv.org/abs/2407.02275v1

Compressor summary: This paper reviews various learning paradigms and methodologies used for creating digital twins in the process industry, and identifies challenges and future directions.


Multilingual Trolley Problems for Language Models

http://arxiv.org/abs/2407.02273v1

Compressor summary: The study explores how large language models make moral decisions across various languages and cultures, finding that their alignment with human preferences varies depending on the language.


Aligning Human Motion Generation with Human Perceptions

http://arxiv.org/abs/2407.02272v1

Compressor summary: The authors propose a new method to generate more realistic human motions by introducing a large dataset and a model that captures human preferences, which can be integrated into the generation pipeline.


Improving Explainability of Softmax Classifiers Using a Prototype-Based Joint Embedding Method

http://arxiv.org/abs/2407.02271v1

Compressor summary: The paper proposes a method to improve explainability and uncertainty estimation in softmax classifiers using prototype-based predictions and similarity learning.


DrugCLIP: Contrastive Drug-Disease Interaction For Drug Repurposing

http://arxiv.org/abs/2407.02265v1

Compressor summary: DrugCLIP is a contrastive learning method that uses machine learning to automatically discover new uses for existing drugs by analyzing drug and disease interactions in large datasets.


SOAF: Scene Occlusion-aware Neural Acoustic Field

http://arxiv.org/abs/2407.02264v1

Compressor summary: Key points: - Paper proposes SOAF, a new approach for novel view audio-visual synthesis in indoor scenes with sound-propagation modelling and scene transmittance learning - SOAF generates binaural audio with directional attention using Fibonacci Sphere feature extraction - SOAF outperforms previous techniques on real and synthetic datasets Summary: The paper presents SOAF, a novel method for generating binaural audio in indoor scenes with accurate sound propagation and directional attention, achieving superior results on real and synthetic data.


FreeCG: Free the Design Space of Clebsch-Gordan Transform for machine learning force field

http://arxiv.org/abs/2407.02263v1

Compressor summary: The FreeCG method improves the Clebsch-Gordan Transform layer by using permutation-invariant abstract edges, group CG transform, sparse paths, and attention enhancement to achieve better force and property predictions for molecular datasets.


SiamTST: A Novel Representation Learning Framework for Enhanced Multivariate Time Series Forecasting applied to Telco Networks

http://arxiv.org/abs/2407.02258v1

Compressor summary: SiamTST is a new way to learn from multivariate time series data using a special type of neural network and some tricks to improve accuracy.


Parameter-Selective Continual Test-Time Adaptation

http://arxiv.org/abs/2407.02253v1

Compressor summary: The paper introduces a new method, Parameter-Selective Mean Teacher (PSMT), that updates only critical parameters in the Mean Teacher model to avoid error accumulation and catastrophic forgetting when adapting to changing environments.


GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models

http://arxiv.org/abs/2407.02252v1

Compressor summary: The authors propose an end-to-end text rendering framework for poster generation using a triple cross-attention mechanism and a high-resolution dataset, aiming to create precise and contextually rich poster images.


EvolBA: Evolutionary Boundary Attack under Hard-label Black Box condition

http://arxiv.org/abs/2407.02248v1

Compressor summary: The study proposes EvolBA, an adversarial attack method using CMA-ES under HL-BB condition, which can find AEs with smaller perturbations in images.


Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

http://arxiv.org/abs/2407.02243v1

Compressor summary: RIO uses reinforcement learning from human feedback to select exemplars that improve the robustness and quality of zero-shot text-to-speech systems by leveraging reverse inference based on the Bayesian principle.


Sign Language Recognition Based On Facial Expression and Hand Skeleton

http://arxiv.org/abs/2407.02241v1

Compressor summary: The paper proposes a sign language recognition network that uses hand skeleton features and facial expressions to improve accuracy and robustness.


Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation

http://arxiv.org/abs/2407.02235v1

Compressor summary: The text describes a study that developed a large language model called BrainGPT to generate accurate and informative 3D brain CT reports by addressing data complexity, model capacity, and evaluation metric issues, and demonstrated its clinical readiness through physician evaluations and a proposed FORTE metric.


Synthetic Multimodal Question Generation

http://arxiv.org/abs/2407.02233v1

Compressor summary: SMMQG is a framework for generating synthetic question-answer pairs from multimodal documents to evaluate MMRAG models, achieving high quality comparable to existing benchmarks.


LaMoD: Latent Motion Diffusion Model For Myocardial Strain Generation

http://arxiv.org/abs/2407.02229v1

Compressor summary: LaMoD is a novel deep learning model that predicts accurate DENSE motions from standard CMR videos for improved myocardial strain analysis in cardiac patients.


MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

http://arxiv.org/abs/2407.02228v1

Compressor summary: MTMamba is a new architecture for multi-task scene understanding that leverages Mamba to handle long-range dependency and model cross-task interactions, outperforming previous methods on NYUDv2 and PASCAL-Context datasets.


Detecting Driver Fatigue With Eye Blink Behavior

http://arxiv.org/abs/2407.02222v1

Compressor summary: The study evaluates an eye blink feature set to detect driver fatigue using camera-based solutions, which are non-intrusive and adapt to different drivers.


Multi-Modal Video Dialog State Tracking in the Wild

http://arxiv.org/abs/2407.02218v1

Compressor summary: MST-MIXER is a novel video dialog model that tracks multiple modalities and learns local latent graphs to improve performance on real-world scenarios.


Physics-Informed Model and Hybrid Planning for Efficient Dyna-Style Reinforcement Learning

http://arxiv.org/abs/2407.02217v1

Compressor summary: This paper shows how using partial physical knowledge can improve reinforcement learning by enhancing sample efficiency, inference speed, and planning in real-world applications.


PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

http://arxiv.org/abs/2407.02211v1

Compressor summary: PromptIntern is a novel method that helps large language models learn prompt knowledge internally, reducing inference costs and increasing speed for complex natural language processing tasks.


Generative Monoculture in Large Language Models

http://arxiv.org/abs/2407.02209v1

Compressor summary: Generative monoculture is a phenomenon in large language models where they produce less diverse outputs than expected for certain tasks, which can have positive and negative consequences depending on the use case, and requires better alignment methods to avoid.


How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise on Machine Translation

http://arxiv.org/abs/2407.02208v1

Compressor summary: The paper proposes a self-correction method to improve machine translation performance by using the model's prediction distribution to revise training supervision in the presence of semantic misalignment noise.


Automatic Adaptation Rule Optimization via Large Language Models

http://arxiv.org/abs/2407.02203v1

Compressor summary: The paper proposes using large language models to optimize rule-based self-adaptation systems by leveraging their common sense and reasoning abilities.


Research on Reliable and Safe Occupancy Grid Prediction in Underground Parking Lots

http://arxiv.org/abs/2407.02197v1

Compressor summary: The study proposes a method to improve autonomous vehicle performance in complex indoor environments like underground parking lots using CARLA's simulation platform and an occupancy grid network.


Attack-Aware Noise Calibration for Differential Privacy

http://arxiv.org/abs/2407.02191v1

Compressor summary: The paper proposes methods to directly calibrate noise scale in differential privacy (DP) models based on attack risk, improving utility without sacrificing privacy.


Structure-Aware Consensus Network on Graphs with Few Labeled Nodes

http://arxiv.org/abs/2407.02188v1

Compressor summary: SACN is a novel graph node classification method that leverages structure-aware consensus learning between two augmented views, integrates structural information, and achieves strong performance especially at low label rates.


Virtually Objective Quantification of in vitro Wound Healing Scratch Assays with the Segment Anything Model

http://arxiv.org/abs/2407.02187v1

Compressor summary: The paper proposes a deep learning method using point-prompts for class-agnostic cell segmentation in the in vitro scratch assay, reducing subjectivity and increasing accuracy.


Occlusion-Aware Seamless Segmentation

http://arxiv.org/abs/2407.02182v1

Compressor summary: The authors introduce OASS, a new task that addresses challenges in panoramic image segmentation, present the BlendPASS dataset, and propose UnmaskFormer, a solution that uses Unmasking Attention and Amodal-oriented Mix to achieve state-of-the-art results.


BeNeRF: Neural Radiance Fields from a Single Blurry Image and Event Stream

http://arxiv.org/abs/2407.02174v1

Compressor summary: The paper proposes a method to recover neural radiance fields (NeRF) from a single blurry image and its camera motion, enabling view-consistent sharp images and high-quality rendering.


WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation

http://arxiv.org/abs/2407.02165v1

Compressor summary: WildAvatar is a large dataset for creating 3D human avatars from YouTube videos, addressing the limitations of existing datasets and enabling real-world applications.


UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks

http://arxiv.org/abs/2407.02158v1

Compressor summary: UltraPixel is a novel architecture that efficiently generates high-quality images at multiple resolutions using cascade diffusion models, implicit neural representations, and scale-aware normalization layers.


FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs

http://arxiv.org/abs/2407.02157v1

Compressor summary: FineCLIPER is a novel framework that uses Multi-modal Fine-grained CLIP to recognize dynamic facial expressions with improved accuracy and adaptability by extending class labels, using hierarchical cues, and adopting Parameter-Efficient Fine-Tuning.


Equidistribution-based training of Free Knot Splines and ReLU Neural Networks

http://arxiv.org/abs/2407.02153v1

Compressor summary: The paper compares one-dimensional function approximation using shallow neural networks with ReLU activation and traditional methods like Free Knot Splines, and proposes a two-level training method for better performance.


VRBiom: A New Periocular Dataset for Biometric Applications of HMD

http://arxiv.org/abs/2407.02150v1

Compressor summary: The VRBiom dataset contains periocular videos acquired using a VR headset for biometric applications, including iris and periocular recognition, with real and spoofed data.


LlamAr & GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

http://arxiv.org/abs/2407.02147v1

Compressor summary: The authors introduce InstAr-500k, a new Arabic instruction dataset, which improves the performance of language models on various Arabic NLP tasks by fine-tuning existing models.


Counterfactual Data Augmentation with Denoising Diffusion for Graph Anomaly Detection

http://arxiv.org/abs/2407.02143v1

Compressor summary: CAGAD enhances anomaly detection in graphs by creating counterfactual node representations using a graph pointer neural network and a diffusion model.


Efficient Nearest Neighbor based Uncertainty Estimation for Natural Language Processing Tasks

http://arxiv.org/abs/2407.02138v1

Compressor summary: The paper proposes a new uncertainty estimation method for DNNs using k-Nearest Neighbor (kNN) that performs well in calibration, selective prediction, and out-of-distribution detection with low inference cost.


Black Big Boxes: Do Language Models Hide a Theory of Adjective Order?

http://arxiv.org/abs/2407.02136v1

Compressor summary: Key points: - The text investigates how language models (LMs) learn adjective order preferences (AOPs) in English and other languages - AOPs involve complex ordering patterns of multiple adjectives in noun phrases that cross syntax, semantics, and pragmatics - The authors present a reusable corpus of adjective pairs and define AOP measures for LMs - They find that LMs perform better than theoretical linguistic factors but still show strong data frequency effects and limited generalization - They suggest future studies and discuss key questions about LM knowledge and generalization Summary: The text explores how LMs acquire AOPs, which are intricate rules of adjective ordering in languages, using a new corpus and measures. It reveals that LMs outperform linguistic theory but struggle with data frequency and generalization.


Hybrid Feature Collaborative Reconstruction Network for Few-Shot Fine-Grained Image Classification

http://arxiv.org/abs/2407.02123v1

Compressor summary: HFCR-Net combines channel features and spatial features to improve few-shot fine-grained image classification by enhancing inter-class differences and reducing intra-class differences through a hybrid feature reconstruction process.


Fake News Detection: It's All in the Data!

http://arxiv.org/abs/2407.02122v1

Compressor summary: The text is a survey about fake news detection datasets, emphasizing their importance for model performance, and providing a GitHub repository with publicly accessible datasets.


Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning

http://arxiv.org/abs/2407.02119v1

Compressor summary: RLHF relies on human feedback, but it's limited by the size of preference data; our approach uses online methods and proxy reward oracles to efficiently label preferences with minimal expert input.


Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

http://arxiv.org/abs/2407.02118v1

Compressor summary: The paper explores constructing large language models for new languages by continually pretraining from existing models, showing faster convergence, resource savings, and different data-parameter allocation compared to training from scratch.


A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data

http://arxiv.org/abs/2407.02112v1

Compressor summary: The paper proposes a data-centric evaluation framework for tabular data models, showing that dataset-specific preprocessing and feature engineering are crucial factors affecting performance, and highlights the importance of test-time adaptation for dynamic data.


HRSAM: Efficiently Segment Anything in High-Resolution Images

http://arxiv.org/abs/2407.02109v1

Compressor summary: HRSAM is a new interactive segmentation model that combines Flash Attention and PSCWin attention to handle high-resolution images and achieve low latency, outperforming previous models.


Automated Knowledge Graph Learning in Industrial Processes

http://arxiv.org/abs/2407.02106v1

Compressor summary: The paper presents a framework for creating knowledge graphs from time series data to help with industrial decision-making and optimization, using Granger causality to find key attributes for predictive models.


Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

http://arxiv.org/abs/2407.02104v1

Compressor summary: Key points: - Pose-estimation methods extract human motion from videos as 3D skeleton sequences - Text-motion retrieval tasks aim to search for motions or descriptions that match a given text - Joint-dataset learning and CCCL help improve text-motion models by using more data and imposing uni-modal constraints - MoT++ is a transformer-based encoder that uses spatio-temporal attention to process skeleton data - The paper evaluates the proposed methods on KIT Motion-Language and HumanML3D datasets Summary: The paper proposes joint-dataset learning, CCCL, and MoT++ to improve text-motion retrieval tasks that search for video motions or descriptions based on natural language.


Helpful assistant or fruitful facilitator? Investigating how personas affect language model behavior

http://arxiv.org/abs/2407.02099v1

Compressor summary: The paper explores how assigning different personas to large language models affects their behavior and compares it to a control setting with a generic "helpful assistant" and no persona.


DM3D: Distortion-Minimized Weight Pruning for Lossless 3D Object Detection

http://arxiv.org/abs/2407.02098v1

Compressor summary: The paper proposes a post-training weight pruning scheme for 3D object detection that reduces computational cost and memory footprint while maintaining or enhancing detection precision.


Efficient Bit Labeling in Factorization Machines with Annealing for Traveling Salesman Problem

http://arxiv.org/abs/2407.02091v1

Compressor summary: This paper studies how different binary labeling methods affect the performance of solving large-scale optimization problems using factorization machines with annealing, and proposes a new method called Gray labeling that improves convergence speed and accuracy for the traveling salesman problem.


GPTCast: a weather language model for precipitation nowcasting

http://arxiv.org/abs/2407.02089v1

Compressor summary: GPTCast is a generative deep-learning method that uses a large language model to learn spatiotemporal precipitation dynamics from radar images and produce realistic ensemble forecasts with accurate uncertainty estimation.


Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

http://arxiv.org/abs/2407.02077v1

Compressor summary: HTCL is a novel method for improving camera-based semantic scene completion by learning hierarchical temporal context and adaptively refining feature sampling locations, achieving state-of-the-art results on benchmarks.


Label Anything: Multi-Class Few-Shot Semantic Segmentation with Visual Prompts

http://arxiv.org/abs/2407.02075v1

Compressor summary: Label Anything is a neural network architecture for few-shot semantic segmentation that uses different visual prompts and trains end-to-end across multi-class scenarios, improving adaptability and generalization.


Latent Diffusion Model for Generating Ensembles of Climate Simulations

http://arxiv.org/abs/2407.02070v1

Compressor summary: The authors propose a new deep learning method to quickly and efficiently generate multiple climate scenarios for uncertainty analysis.


LPViT: Low-Power Semi-structured Pruning for Vision Transformers

http://arxiv.org/abs/2407.02068v1

Compressor summary: Key points: - Vision transformers (ViTs) are powerful but resource-intensive for image analysis tasks - Block-structured pruning leverages the block-wise structure of linear layers to reduce resource requirements - A hardware-aware learning objective is proposed to optimize the pruning scheme and balance accuracy and power consumption - Experiments on ImageNet show competitive performance and significant speedup and power reduction for ViTs Summary: The paper introduces a new pruning method for vision transformers that reduces resource usage by exploiting block-wise layer structure and optimizes it with a hardware-aware objective, achieving fast and energy-efficient image analysis.


Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

http://arxiv.org/abs/2407.02067v1

Compressor summary: The study examines large multimodal models' ability to recognize and adapt across different cultural contexts using a new dataset, Dalle Street.


BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

http://arxiv.org/abs/2407.02066v1

Compressor summary: The paper studies how Vision Language Models can have implicit social biases across nine dimensions and creates a dataset to identify and mitigate them.


Are Data Augmentation Methods in Named Entity Recognition Applicable for Uncertainty Estimation?

http://arxiv.org/abs/2407.02062v1

Compressor summary: Data augmentation enhances confidence calibration and uncertainty estimation in Named Entity Recognition tasks using Deep Neural Networks.


Terminating Differentiable Tree Experts

http://arxiv.org/abs/2407.02060v1

Compressor summary: The paper introduces a new neuro-symbolic model that combines transformers and Tensor Product Representations, improves it by reducing parameters and introducing a mixture of experts, and proposes an automatic termination algorithm for controlling the number of steps in the computation.


HC-GLAD: Dual Hyperbolic Contrastive Learning for Unsupervised Graph-Level Anomaly Detection

http://arxiv.org/abs/2407.02057v1

Compressor summary: The paper proposes a new method for detecting anomalies in graphs using hypergraphs, node group connections, and hyperbolic geometry.


Integrate the Essence and Eliminate the Dross: Fine-Grained Self-Consistency for Free-Form Language Generation

http://arxiv.org/abs/2407.02056v1

Compressor summary: Fine-Grained Self-Consistency (FSC) enhances LLMs' performance in both open-ended and reasoning tasks by integrating segment-level commonalities from candidate samples, and introduces two strategies to further improve output quality.


Abstract Dialectical Frameworks are Boolean Networks (full version)

http://arxiv.org/abs/2407.02055v1

Compressor summary: The paper explores how dialectical frameworks and Boolean regulatory networks, two different models from argumentation and biology respectively, have similarities in appearance and can be related to produce new insights.


CountFormer: Multi-View Crowd Counting Transformer

http://arxiv.org/abs/2407.02047v1

Compressor summary: CountFormer is a 3D multi-view counting framework that handles different camera layouts and achieves superior performance using scene-level volume representation and attention mechanism.


Concise and Precise Context Compression for Tool-Using Language Models

http://arxiv.org/abs/2407.02043v1

Compressor summary: The authors propose two methods to compress tool documentation for language models, reducing input length and decoding time while preserving key information.


Fake News Detection and Manipulation Reasoning via Large Vision-Language Models

http://arxiv.org/abs/2407.02042v1

Compressor summary: Key points: - Paper proposes manipulation reasoning, a new research topic for detecting fake news based on news content - Introduces HFFN, a benchmark with human-centric and fact-related domains and detailed annotations - Presents M-DRUM, a multi-modal model that extracts fusion features and raises analytical reasoning about manipulations - Outperforms SOTA models and LVLMs like GPT-4 and LLaVA Summary: The paper introduces manipulation reasoning, a new way to detect fake news by reasoning about content manipulations. It also presents HFFN, a benchmark with realistic domains and annotations, and M-DRUM, a multi-modal model that beats existing methods.


ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

http://arxiv.org/abs/2407.02040v1

Compressor summary: ASD is a text-to-3D method that uses diffusion models to synthesize 3D contents faster and more accurately by adjusting the model's timestep, enabling it to handle large amounts of text prompts.


Prompt Stability Scoring for Text Annotation with Large Language Models

http://arxiv.org/abs/2407.02039v1

Compressor summary: The authors propose a Prompt Stability Score (PSS) metric and a Python package to measure and improve the reproducibility of language models for text annotation tasks.


Camera-LiDAR Cross-modality Gait Recognition

http://arxiv.org/abs/2407.02038v1

Compressor summary: The paper proposes CL-Gait, a cross-modality gait recognition framework between cameras and LiDARs, using a two-stream network and contrastive pre-training with virtual data generation.


TrAME: Trajectory-Anchored Multi-View Editing for Text-Guided 3D Gaussian Splatting Manipulation

http://arxiv.org/abs/2407.02034v1

Compressor summary: Our paper proposes a progressive 3D editing strategy with Trajectory-Anchored Scheme and dual-branch editing mechanism to ensure multi-view consistency and improve editing quality in text-guided 3D scene editing.


Breaking Bias, Building Bridges: Evaluation and Mitigation of Social Biases in LLMs via Contact Hypothesis

http://arxiv.org/abs/2407.02030v1

Compressor summary: The text explores a debiasing technique for large language models using the Contact Hypothesis, which involves simulating social contact through prompts and instruction-tuning with unbiased responses to reduce prejudices in LLMs.


Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions

http://arxiv.org/abs/2407.02028v1

Compressor summary: The study evaluates in-context learning for open and closed questions of varying difficulty and novelty, revealing a counter-intuitive effect of context relevancy on hard and novel questions.


On the Expressive Power of Sparse Geometric MPNNs

http://arxiv.org/abs/2407.02025v1

Compressor summary: The paper presents a message-passing neural network architecture called EGENNET that can separate non-equivalent geometric graphs based on local features and rotation equivariance, improving upon previous methods for connected graphs and globally rigid graphs.


Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

http://arxiv.org/abs/2407.02014v1

Compressor summary: The paper proposes a new contrastive learning method that learns multi-grained representations for better generalization on various downstream tasks, outperforming existing methods without using large-scale pretraining data.


DiGRAF: Diffeomorphic Graph-Adaptive Activation Function

http://arxiv.org/abs/2407.02013v1

Compressor summary: Key points: - The paper introduces DiGRAF, a new activation function for graph neural networks (GNNs) - DiGRAF is based on continuous piecewise-affine transformations and learns to adapt to different graphs - DiGRAF has desirable properties such as differentiability, boundness and efficiency - Experiments show that DiGRAF outperforms other activation functions for GNNs Summary: DiGRAF is a novel activation function for GNNs that learns to adapt to graph data using continuous piecewise-affine transformations and has desirable properties, achieving superior performance in experiments.


An End-to-End Speech Summarization Using Large Language Model

http://arxiv.org/abs/2407.02005v1

Compressor summary: Key points: - SSum generates text summaries from spoken content - Challenges include long speech input and cross-modal mapping - Proposed model uses Q-Former and LLMs for audio-text fusion and summary generation - Multi-stage training with ASR and TSum as auxiliary tasks - Curriculum learning helps transition from TSum to SSum - Competitive performance on How-2 dataset Summary: The paper presents a new SSum model that uses Q-Former and LLMs to fuse audio and text, trains with ASR and TSum tasks, and employs curriculum learning for transitioning between tasks. It performs well on the How-2 dataset.


SAVE: Segment Audio-Visual Easy way using Segment Anything Model

http://arxiv.org/abs/2407.02004v1

Compressor summary: The study presents SAVE, a lightweight approach that adapts the pre-trained SAM model for audio-visual segmentation by using image and audio encoder adapters to improve fusion and speed, achieving higher performance on real data than previous methods.


ViG-Bias: Visually Grounded Bias Discovery and Mitigation

http://arxiv.org/abs/2407.01996v1

Compressor summary: Visually Grounded Bias Discovery and Mitigation (ViG-Bias) improves bias detection and reduction in machine learning models by using visual explanations from large vision language models.


Simple Augmentations of Logical Rules for Neuro-Symbolic Knowledge Graph Completion

http://arxiv.org/abs/2407.01994v1

Compressor summary: This work proposes three techniques to enhance rule sets for Neuro-Symbolic Knowledge Graph Completion models, achieving significant improvements in coverage and performance.


Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?

http://arxiv.org/abs/2407.01992v1

Compressor summary: The authors use graph mining to create a contrast set from MCQA datasets and show that large language models do not rely on choices-only shortcuts for high performance in multiple-choice question answering.


Generation of Geodesics with Actor-Critic Reinforcement Learning to Predict Midpoints

http://arxiv.org/abs/2407.01991v1

Compressor summary: The paper proposes a method to find shortest paths on continuous surfaces using recursive midpoint prediction and an actor-critic learning approach, which is theoretically sound and performs better than previous methods in local and global path planning.


AHMsys: An Automated HVAC Modeling System for BIM Project

http://arxiv.org/abs/2407.01987v1

Compressor summary: AHMsys is a system that automates creating 3D HVAC models from 2D CAD drawings, reducing BIM process time by 20 percent.


SADL: An Effective In-Context Learning Method for Compositional Visual QA

http://arxiv.org/abs/2407.01983v1

Compressor summary: The paper introduces SADL, a visual-linguistic prompting framework for in-context learning in Visual QA that samples, decomposes, and pseudo-labels image-question pairs to bridge the semantic gap between symbols and images.


Unveiling Global Interactive Patterns across Graphs: Towards Interpretable Graph Neural Networks

http://arxiv.org/abs/2407.01979v1

Compressor summary: Key points: - Graph Neural Networks (GNNs) are widely used for graph mining tasks - Existing explanations focus on subgraph features and local structures, but graph classification requires global interactions - The paper proposes Global Interactive Pattern (GIP) learning, which introduces learnable global patterns to interpret decisions - GIP uses graph clustering and prototype matching to facilitate transparent graph-level reasoning - Experiments show that GIP improves interpretability and performs well on synthetic and real-world data Summary: The paper presents a novel interpretation scheme for graph classification with GNNs, called GIP, which learns global interactive patterns using graph clustering and prototype matching.


A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

http://arxiv.org/abs/2407.01976v1

Compressor summary: LayTextLLM is a new method for document understanding that improves performance in Key Information Extraction and Visual Question Answering by interleaving text and spatial layout embeddings without long sequence issues.


Pseudo-Labeling by Multi-Policy Viewfinder Network for Image Cropping

http://arxiv.org/abs/2407.01971v1

Compressor summary: MPV-Net uses diverse refining policies to generate trusted pseudo labels for image cropping models using labeled and unlabeled data, achieving state-of-the-art results.


Unleash the Power of Local Representations for Few-Shot Classification

http://arxiv.org/abs/2407.01967v1

Compressor summary: The paper proposes a novel pretraining paradigm with soft labels and a metric with adaptability to improve few-shot classification using local representations.


AdaCQR: Enhancing Query Reformulation for Conversational Search via Sparse and Dense Retrieval Alignment

http://arxiv.org/abs/2407.01965v1

Compressor summary: The paper proposes a novel framework called AdaCQR that improves conversational search by aligning reformulation models with different types of retrieval systems and using efficient techniques to acquire better labels and input candidates.


Enabling Discriminative Reasoning in Large Language Models for Legal Judgment Prediction

http://arxiv.org/abs/2407.01964v1

Compressor summary: The paper proposes a new reasoning framework (ADAPT) to adapt large language models for better legal judgment prediction by understanding case facts, discriminating charges, and predicting judgments.


Zero-shot Video Restoration and Enhancement Using Pre-Trained Image Diffusion Model

http://arxiv.org/abs/2407.01960v1

Compressor summary: The paper introduces a framework for zero-shot video restoration and enhancement using a pre-trained image diffusion model with a cross-previous-frame attention layer and other techniques to reduce temporal flickering artifacts and improve quality.


FlowTrack: Point-level Flow Network for 3D Single Object Tracking

http://arxiv.org/abs/2407.01959v1

Compressor summary: The paper proposes FlowTrack, a point-level flow method for 3D single object tracking that captures local motion details, handles sparse points with a learnable target feature, and aggregates local motion information into global motion using an Instance Flow Head.


S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

http://arxiv.org/abs/2407.01955v1

Compressor summary: The paper proposes a new method to speed up large language models' decoding process by using sorted speculative decoding with multiple draft models for different target models, reducing costs and improving performance.


Extracting and Encoding: Leveraging Large Language Models and Medical Knowledge to Enhance Radiological Text Representation

http://arxiv.org/abs/2407.01948v1

Compressor summary: Key points: - The text presents a two-stage framework to extract factual statements from radiology reports and improve text encoders for various downstream tasks. - The framework consists of a Fact Extractor using LLMs and a Fact Encoder based on a BERT model fine-tuned with objective functions. - The text also introduces a new metric (CXRFEScore) for evaluating chest X-ray text generation systems. Summary: The text describes a framework that uses factual statements from radiology reports to enhance text encoders and evaluate chest X-ray text generation systems.


Indoor 3D Reconstruction with an Unknown Camera-Projector Pair

http://arxiv.org/abs/2407.01945v1

Compressor summary: The paper presents a simple and reliable method for calibrating a camera-projector pair (CPP) using an unknown cuboid corner, enabling direct 3D reconstruction in indoor scenes with weak textures.


Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness

http://arxiv.org/abs/2407.01942v1

Compressor summary: The paper proposes a taxonomy of uncertainty in vision-language AI systems, creates a dataset with contrastive examples, and introduces a new metric for measuring calibration error.


Efficient-Empathy: Towards Efficient and Effective Selection of Empathy Data

http://arxiv.org/abs/2407.01937v1

Compressor summary: The paper introduces Efficient-Empathy, an algorithm that selects high-quality empathetic data to improve large language models' performance and efficiency in empathetic dialogues.


Probabilistic 3D Correspondence Prediction from Sparse Unsegmented Images

http://arxiv.org/abs/2407.01931v1

Compressor summary: SPI-CorrNet is a deep learning model that predicts 3D correspondences from sparse imaging data, improving the accuracy and robustness of statistical shape modeling in clinical research.


Self-Cooperation Knowledge Distillation for Novel Class Discovery

http://arxiv.org/abs/2407.01930v1

Compressor summary: The paper proposes a Self-Cooperation Knowledge Distillation (SCKD) method to balance reviewing known classes and discovering novel classes in unsupervised learning.


What We Talk About When We Talk About LMs: Implicit Paradigm Shifts and the Ship of Language Models

http://arxiv.org/abs/2407.01929v1

Compressor summary: The paper examines the evolution of the term "Language Models" in scientific discourse and calls for a new perspective on how systems and theories influence each other.


SymPoint Revolutionized: Boosting Panoptic Symbol Spotting with Layer Feature Enhancement

http://arxiv.org/abs/2407.01928v1

Compressor summary: SymPoint-V2 is a new, improved method for recognizing symbols in CAD drawings that uses layer information and faster training to outperform its predecessor SymPoint.


Looking From the Future: Multi-order Iterations Can Enhance Adversarial Attack Transferability

http://arxiv.org/abs/2407.01925v1

Compressor summary: Key points: - Paper proposes a new optimization concept, Looking From the Future (LFF), for adversarial attacks - LFF improves generalization from different perspectives and transfers better to unseen tasks - Paper also introduces multi-order attack method, $LLF^{\mathcal{N}}$, that extends LFF to more tasks - Experiments on ImageNet1k dataset show significant enhancement of attack transferability Summary: The paper introduces a novel optimization concept, LFF, for adversarial attacks that improves generalization and transferability. It also presents a multi-order attack method, $LLF^{\mathcal{N}}$, and shows its effectiveness on ImageNet1k dataset.


GVDIFF: Grounded Text-to-Video Generation with Diffusion Models

http://arxiv.org/abs/2407.01921v1

Compressor summary: GVDIFF is a text-to-video framework that uses uncertainty-based representations, spatial-temporal grounding, and dynamic gates to generate videos guided by text with different applications.


To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models

http://arxiv.org/abs/2407.01920v1

Compressor summary: Key points: - Large Language Models (LLMs) can have sensitive data - Knowledge unlearning tries to erase specific knowledge from LLMs - Current methods are not precise and may lose essential knowledge - MemFlex is a new method that uses gradient information to target sensitive parameters - MemFlex performs better than existing methods in unlearning and retaining knowledge Summary: MemFlex is a novel method that uses gradients to precisely erase sensitive data from LLMs without losing essential knowledge, unlike current imprecise unlearning approaches.


Sequential Manipulation Against Rank Aggregation: Theory and Algorithm

http://arxiv.org/abs/2407.01916v1

Compressor summary: The text discusses ranking manipulation using pairwise comparisons and proposes distributionally robust methods to attack and defend against such manipulations.


Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model

http://arxiv.org/abs/2407.01911v1

Compressor summary: The authors developed a method to convert single-channel speech recordings into pseudo-stereo data, which increased the training dataset size and improved the performance of spoken dialogue language models.


MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

http://arxiv.org/abs/2407.01910v1

Compressor summary: Key points: - Large Language Models (LLMs) can help streamline hardware design processes - Existing hardware datasets are often limited and need to be improved - Proposed a Multi-Grained-Verilog dataset with criteria for quality - Introduced a balanced fine-tuning scheme to leverage the dataset's diversity Summary: The authors propose a high-quality Multi-Grained-Verilog dataset and a balanced fine-tuning scheme to enhance LLMs in hardware design tasks, addressing the limitations of existing hardware datasets.


Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models

http://arxiv.org/abs/2407.01909v1

Compressor summary: The paper introduces a new Chinese ASR error correction dataset and shows that Pinyin regularization improves the performance of large language models in this task.


The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

http://arxiv.org/abs/2407.01907v1

Compressor summary: The paper presents a new method for answering questions about videos that uses two steps: first, it uses VALOR to answer questions based on video information, and second, it uses TubeDETR to find bounding boxes of objects in the video.


Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

http://arxiv.org/abs/2407.01906v1

Compressor summary: The paper explores parameter-efficient fine-tuning for sparse-architecture large language models with a Mixture-of-Experts architecture, proposing Expert-Specialized Fine-Tuning that improves tuning efficiency and performance.


Enhancing Multi-Class Anomaly Detection via Diffusion Refinement with Dual Conditioning

http://arxiv.org/abs/2407.01905v1

Compressor summary: The paper proposes a diffusion model and a transformer combined approach for multi-class anomaly detection in industry, which improves accuracy and avoids common problems such as blurry reconstruction and identical shortcuts.


Text-Aware Diffusion for Policy Learning

http://arxiv.org/abs/2407.01903v1

Compressor summary: TADPoLe is a method that uses pretrained generative models to learn policies from natural language without explicit rewards or demonstrations, achieving natural and diverse behaviors in various robotic domains.


Scope-enhanced Compositional Semantic Parsing for DRT

http://arxiv.org/abs/2407.01899v1

Compressor summary: The AMS parser is a neurosymbolic semantic parser for DRT that excels at handling complex sentences by predicting quantifier scope.


Proposal Report for the 2nd SciCAP Competition 2024

http://arxiv.org/abs/2407.01897v1

Compressor summary: The paper presents a document summarization method that uses auxiliary information to efficiently summarize content related to described objects in texts, achieving top scores in two tracks of the SciCAP competition.


LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

http://arxiv.org/abs/2407.01896v1

Compressor summary: LogEval is a benchmark suite to evaluate how well large language models perform in various log analysis tasks for AIOps.


Adaptive Modality Balanced Online Knowledge Distillation for Brain-Eye-Computer based Dim Object Detection

http://arxiv.org/abs/2407.01894v1

Compressor summary: The paper presents a brain-eye-computer system for detecting dim targets in aerial images using EEG and computer vision, and proposes an adaptive modality balanced online knowledge distillation (AMBOKD) method to fuse EEG and image features effectively.


GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

http://arxiv.org/abs/2407.01892v1

Compressor summary: The paper introduces GRASP, a large-scale benchmark for evaluating commonsense spatial reasoning in language models, and shows that current advanced LLMs perform poorly on it.


Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents

http://arxiv.org/abs/2407.01887v1

Compressor summary: The paper evaluates LLMs' decision-making abilities in Dueling Bandits, compares their performance to existing algorithms, and proposes an augmented algorithm that combines LLMs' strengths with classic DB guarantees.


Core Knowledge Learning Framework for Graph Adaptation and Scalability Learning

http://arxiv.org/abs/2407.01886v1

Compressor summary: The paper proposes a novel algorithm that learns the core subgraph of a graph to improve adaptability, scalability, and generalizability in graph classification tasks.


Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application

http://arxiv.org/abs/2407.01885v1

Compressor summary: Key points: - Large Language Models (LLMs) are powerful but computationally expensive - Knowledge distillation is a technique to compress LLMs while preserving accuracy - The paper surveys various knowledge distillation methods, evaluation tasks, and applications for LLMs Summary: The paper reviews different ways of compressing large language models using knowledge distillation techniques, as well as how to evaluate and apply them.


EIT-1M: One Million EEG-Image-Text Pairs for Human Visual-textual Recognition and More

http://arxiv.org/abs/2407.01884v1

Compressor summary: The authors propose a large multi-modal EEG dataset, EIT-1M, with over 1 million pairs of images, texts, and brain activity recordings to study how the brain processes multiple modalities simultaneously.


Compare without Despair: Reliable Preference Evaluation with Generation Separability

http://arxiv.org/abs/2407.01878v1

Compressor summary: Separability is a meta-evaluation measure that estimates how suitable a test instance is for pairwise preference evaluation by measuring the distinguishability of multiple generations from a model pair.


Spatio-Temporal Graphical Counterfactuals: An Overview

http://arxiv.org/abs/2407.01875v1

Compressor summary: This paper reviews various counterfactual models for AI and proposes a unified graphical approach to handle spatial and temporal interactions.


Automated Text Scoring in the Age of Generative AI for the GPU-poor

http://arxiv.org/abs/2407.01873v1

Compressor summary: The study explores using open-source, small-scale generative language models for automated text scoring and feedback generation on modest hardware.


Referring Atomic Video Action Recognition

http://arxiv.org/abs/2407.01872v1

Compressor summary: The paper introduces RAVAR, a new task to recognize atomic actions of a specific person based on text and video, and presents RefAtomNet, a novel method that uses cross-stream attention to address the challenges of this task.


Image-GS: Content-Adaptive Image Representation via 2D Gaussians

http://arxiv.org/abs/2407.01866v1

Compressor summary: Image-GS is a novel image representation using anisotropic 2D Gaussians that enables content-adaptive rendering with high memory efficiency, fast random access, and a natural level of detail stack for various applications.


Research on target detection method of distracted driving behavior based on improved YOLOv8

http://arxiv.org/abs/2407.01864v1

Compressor summary: The study presents a more efficient and accurate YOLOv8-based method for detecting and classifying distracted driving behaviour by integrating BoTNet, GAM attention and EIoU loss.


VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

http://arxiv.org/abs/2407.01863v1

Compressor summary: The study introduces VSP, a benchmark to evaluate visual spatial planning capabilities of vision language models, and reveals their deficiencies in perception, reasoning, and general planning tasks.