arxiv compressed, 2024-06-27

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-06-27 generated by the compressor, my personal LLM-based project.


Towards Compositionality in Concept Learning

http://arxiv.org/abs/2406.18534v1

Compressor summary: The paper introduces Compositional Concept Extraction (CCE), a method to find more useful concept representations for foundation models that can be combined to explain the full sample.


On Scaling Up 3D Gaussian Splatting Training

http://arxiv.org/abs/2406.18533v1

Compressor summary: Grendel is a distributed system that partitions and parallelizes 3D Gaussian Splatting (3DGS) computation across multiple GPUs, improving rendering quality for large-scale 3D reconstruction tasks.


Symbolic Learning Enables Self-Evolving Agents

http://arxiv.org/abs/2406.18532v1

Compressor summary: The paper proposes a framework for language agents to optimize themselves using natural language versions of weights, loss, and gradients, enabling them to evolve in the wild.


MatchTime: Towards Automatic Soccer Game Commentary Generation

http://arxiv.org/abs/2406.18530v1

Compressor summary: The paper presents a method to improve soccer game commentary generation by aligning video and text using manual annotations and a multi-modal pipeline, leading to better results on a benchmark dataset.


Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs

http://arxiv.org/abs/2406.18529v1

Compressor summary: The paper proposes a new algorithm to learn efficiently in constrained reinforcement learning with linear function approximation and $q_{\pi}$-realizability, achieving polynomial sample complexity.


PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation

http://arxiv.org/abs/2406.18528v1

Compressor summary: PrExMe is a large-scale comparison of prompts for evaluating natural language generation models, revealing the stability and variability of different strategies in machine translation and summarization tasks.


MultiDiff: Consistent Novel View Synthesis from a Single Image

http://arxiv.org/abs/2406.18524v1

Compressor summary: MultiDiff is a novel method for generating consistent novel views from a single image using depth predictors and video-diffusion models to improve geometric stability and pixel accuracy.


ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

http://arxiv.org/abs/2406.18522v1

Compressor summary: ChronoMagic-Bench is a new text-to-video benchmark that evaluates models' ability to generate coherent and diverse time-lapse videos across various categories of natural phenomena.


CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

http://arxiv.org/abs/2406.18521v1

Compressor summary: CharXiv is a new evaluation suite for multimodal language models that tests their ability to understand diverse and challenging charts, revealing significant performance gaps and weaknesses.


APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

http://arxiv.org/abs/2406.18518v1

Compressor summary: APIGen generates diverse and reliable function-calling datasets for agents, achieving state-of-the-art performance on benchmarks.


Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration

http://arxiv.org/abs/2406.18516v1

Compressor summary: The paper proposes a domain adaptation method for image restoration tasks using diffusion models and noise prediction in the noise-space to align synthetic and real-world data to a common clean distribution.


"Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline

http://arxiv.org/abs/2406.18512v1

Compressor summary: The research explores how conversational explanations work and tests the effectiveness of large language models like GPT4 in generating explanation dialogues using a 5-Levels dataset with annotated explanatory acts.


Mental Modeling of Reinforcement Learning Agents by Language Models

http://arxiv.org/abs/2406.18505v1

Compressor summary: The study investigates how well large language models can understand an agent's behavior in the physical world by using their knowledge and reasoning, finding that they are not yet fully capable without further improvements.


Is In-Context Learning a Type of Gradient-Based Learning? Evidence from the Inverse Frequency Effect in Structural Priming

http://arxiv.org/abs/2406.18501v1

Compressor summary: The paper proposes a method to diagnose if large language models' in-context learning is equivalent to gradient-based learning using the inverse frequency effect and finds evidence supporting this hypothesis.


WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

http://arxiv.org/abs/2406.18495v1

Compressor summary: WildGuard is a tool that detects malicious intent and safety risks in Large Language Models (LLMs) interactions and evaluates their refusal rates, outperforming existing tools and matching GPT-4's performance.


Robust Surgical Phase Recognition From Annotation Efficient Supervision

http://arxiv.org/abs/2406.18481v1

Compressor summary: The authors propose a robust surgical phase recognition method that can handle missing annotations and introduce SkipTag@K, an annotation approach that reduces costs while maintaining performance.


GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality

http://arxiv.org/abs/2406.18462v1

Compressor summary: The paper introduces GaussianDreamerPro, a framework that binds 3D Gaussians to reasonable geometry to generate high-quality assets with improved details and applicability in various tasks.


Role-Play Zero-Shot Prompting with Large Language Models for Open-Domain Human-Machine Conversation

http://arxiv.org/abs/2406.18460v1

Compressor summary: The study explores using role-play zero-shot prompting with multilingual LLMs and an instruction-following model to create efficient and cost-effective open-domain conversational agents, achieving high results in French.


DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

http://arxiv.org/abs/2406.18459v1

Compressor summary: Key points: - The text discusses the challenges of generating high-resolution images with diffusion models, which are widely used for computer vision tasks. - The text proposes a novel progressive approach that uses low-resolution images to guide higher resolution image generation without additional training or fine-tuning. - The text claims that this method is efficient and effective in producing high-quality images. Summary: The text introduces a new method to generate high-resolution images with diffusion models using low-resolution images as guidance, avoiding extra training or fine-tuning and achieving good results.


Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference

http://arxiv.org/abs/2406.18453v1

Compressor summary: The authors propose a novel method for relative pose estimation without training data that uses 2.5D shape from RGB-D reference, off-the-shelf differentiable renderer, and semantic cues from pretrained model.


Preference Elicitation for Offline Reinforcement Learning

http://arxiv.org/abs/2406.18450v1

Compressor summary: Sim-OPRL is an offline preference-based RL algorithm that uses a learned environment model to get preference feedback without interacting with the real environment.


Cascading Large Language Models for Salient Event Graph Generation

http://arxiv.org/abs/2406.18449v1

Compressor summary: The paper proposes CALLMSAE, a framework that uses large language models to generate event graphs from documents, focusing on salient events and their relations.


An Autotuning-based Optimization Framework for Mixed-kernel SVM Classifications in Smart Pixel Datasets and Heterojunction Transistors

http://arxiv.org/abs/2406.18445v1

Compressor summary: The paper proposes an autotuning-based framework to optimize hyperparameters in SVMs using mixed kernels for high-dimensional data in HEP and MKH applications, achieving high classification accuracy.


Unveiling the Unknown: Conditional Evidence Decoupling for Unknown Rejection

http://arxiv.org/abs/2406.18443v1

Compressor summary: The paper presents a novel open-set object detection framework that uses conditional evidence decoupling to reject unknown classes and improve performance with scarce training data.


Facial Image Feature Analysis and its Specialization for Fréchet Distance and Neighborhoods

http://arxiv.org/abs/2406.18430v1

Compressor summary: The paper explores how training features on a specific domain affects distance measurement in facial images using self-supervision learning.


Graph Neural Networks for Emulation of Finite-Element Ice Dynamics in Greenland and Antarctic Ice Sheets

http://arxiv.org/abs/2406.18423v1

Compressor summary: The study proposes an equivariant graph convolutional network (EGCN) as a more accurate and efficient emulator for ice sheet dynamics modeling than convolutional neural networks (CNNs).


Repeat and Concatenate: 2D to 3D Image Translation with 3D to 3D Generative Modeling

http://arxiv.org/abs/2406.18422v1

Compressor summary: The paper proposes a simple technique to translate 2D X-ray images into 3D CT-like reconstructions by concatenating multiple 2D views and using neural optimal transport for effective information integration.


Mixture of Experts in a Mixture of RL settings

http://arxiv.org/abs/2406.18420v1

Compressor summary: The text discusses Mixtures of Experts (MoEs) in Deep Reinforcement Learning (DRL), their benefits for handling non-stationarity, and how multi-task training can enhance MoE's performance and understanding.


Towards diffusion models for large-scale sea-ice modelling

http://arxiv.org/abs/2406.18417v1

Compressor summary: Latent diffusion models can generate realistic Arctic sea-ice states using less computational resources than traditional methods, but they may introduce smoothing that affects their accuracy.


BiTrack: Bidirectional Offline 3D Multi-Object Tracking Using Camera-LiDAR Data

http://arxiv.org/abs/2406.18414v1

Compressor summary: BiTrack is a novel 3D object tracking framework that fuses 2D-3D detection, generates reliable trajectories, and refines them using point-level registration, data association, and re-optimization techniques.


IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons

http://arxiv.org/abs/2406.18406v1

Compressor summary: IRCAN is a framework that strengthens large language models to handle contextual cues and resolve knowledge conflicts using context-aware attribution scores and neuron reweighting.


LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

http://arxiv.org/abs/2406.18403v1

Compressor summary: JUDGE-BENCH is a dataset for evaluating LLMs' ability to replicate human annotations in NLP, showing they still cannot fully replace human judgments.


Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

http://arxiv.org/abs/2406.18400v1

Compressor summary: LLMs can recall facts based on context, acting like an associative memory model with self-attention and value matrix.


DoubleTake: Geometry Guided Depth Estimation

http://arxiv.org/abs/2406.18387v1

Compressor summary: Our model uses self-generated geometric hints from previous frames to improve depth estimation and 3D scene reconstruction in real time.


KAGNNs: Kolmogorov-Arnold Networks meet Graph Learning

http://arxiv.org/abs/2406.18380v1

Compressor summary: Graph Neural Networks (GNNs) use message passing layers to update node representations, with MLPs as a common transformation; Kolmogorov-Arnold Networks (KANs), based on a theorem for function representation, are a promising alternative and perform well in graph regression tasks.


From Majority to Minority: A Diffusion-based Augmentation for Underrepresented Groups in Skin Lesion Analysis

http://arxiv.org/abs/2406.18375v1

Compressor summary: The authors propose a method to improve skin cancer diagnosis for minority groups using synthetic images generated from majority group data.


Dynamic Data Pruning for Automatic Speech Recognition

http://arxiv.org/abs/2406.18373v1

Compressor summary: The paper proposes dynamic data pruning for ASR, which can save training time and maintain performance by selecting relevant data.


Themis: Towards Flexible and Interpretable NLG Evaluation

http://arxiv.org/abs/2406.18365v1

Compressor summary: The paper introduces a large corpus for natural language generation evaluation and a new language model (Themis) that can perform flexible and accurate evaluations without references.


Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model

http://arxiv.org/abs/2406.18364v1

Compressor summary: The paper proposes an improved BERTSum-LSTM model for information extraction from Chinese news, addressing the challenges of complex semantics, large information volume, and language particularity.


Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

http://arxiv.org/abs/2406.18361v1

Compressor summary: The paper introduces SDSeg, a latent diffusion segmentation model that overcomes challenges in medical image segmentation with stable diffusion and single-step reverse process.


XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

http://arxiv.org/abs/2406.18360v1

Compressor summary: This paper introduces a new driving view synthesis dataset and benchmark for testing autonomous vehicle systems in challenging scenarios beyond real-world data.


Kolmogorov-Arnold Graph Neural Networks

http://arxiv.org/abs/2406.18354v1

Compressor summary: GKAN is a new graph neural network model that combines spline functions on edges with accuracy and interpretability, performing well in various graph-based tasks.


Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control

http://arxiv.org/abs/2406.18351v1

Compressor summary: The paper proposes a decision framework using reinforcement learning, feedback graph, and intrinsic motivation to improve sample efficiency and inventory control in real-world applications.


On Reducing Activity with Distillation and Regularization for Energy Efficient Spiking Neural Networks

http://arxiv.org/abs/2406.18350v1

Compressor summary: The paper proposes using Knowledge Distillation and Logits Regularization to reduce the spiking activity of spiking neural networks without sacrificing performance, making them more energy-efficient for Edge applications.


AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

http://arxiv.org/abs/2406.18346v1

Compressor summary: The paper analyzes the challenges and limitations of using Reinforcement Learning from Feedback methods to align Artificial Intelligence systems with human values and intentions, and suggests a more critical and reflective approach to their application.


EmT: A Novel Transformer for Generalized Cross-subject EEG Emotion Recognition

http://arxiv.org/abs/2406.18345v1

Compressor summary: The emotion transformer (EmT) model leverages prior neurophysiological knowledge to enhance EEG emotion decoding by capturing long-term contextual information and performing well on cross-subject classification and regression tasks.


AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space

http://arxiv.org/abs/2406.18344v1

Compressor summary: The study shows that deep neural networks trained with different objectives share common features related to distinct brain regions, revealing the formation of visual concepts and the processing of visual information in networks.


Grammar Assistance Using Syntactic Structures (GAUSS)

http://arxiv.org/abs/2406.18340v1

Compressor summary: The proposed system offers Spanish grammar coaching with informative feedback, using a linguistic formalism and a fast parsing algorithm that reduces reliance on neural methods and costs.


Efficient and Accurate Explanation Estimation with Distribution Compression

http://arxiv.org/abs/2406.18334v1

Compressor summary: The paper proposes Compress Then Explain (CTE), a new method to improve the efficiency and accuracy of machine learning explanations by using distribution compression through kernel thinning instead of standard i.i.d. sampling.


Continuous Sign Language Recognition Using Intra-inter Gloss Attention

http://arxiv.org/abs/2406.18333v1

Compressor summary: The text introduces a novel module for sign language recognition that leverages local and global contexts using intra-inter gloss attention, improving accuracy without prior knowledge.


Early Classification of Time Series: Taxonomy and Benchmark

http://arxiv.org/abs/2406.18332v1

Compressor summary: The paper proposes a principle-based taxonomy and evaluation dimensions for early classification of time series, and presents experiments comparing nine state-of-the art methods using an open-source library.


Molecular Diffusion Models with Virtual Receptors

http://arxiv.org/abs/2406.18330v1

Compressor summary: The authors propose a technique for structure-based drug design that uses virtual receptors and protein language embeddings to improve performance and speed.


PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models

http://arxiv.org/abs/2406.18326v1

Compressor summary: PaCoST is a method to detect cheating in large language models by comparing their confidence under original and contaminated benchmarks.


MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

http://arxiv.org/abs/2406.18321v1

Compressor summary: The paper introduces MathOdyssey, a new dataset for testing large language models' abilities in complex math problems, and shows that current LLMs still struggle with the hardest tasks.


ContactNet: Geometric-Based Deep Learning Model for Predicting Protein-Protein Interactions

http://arxiv.org/abs/2406.18314v1

Compressor summary: ContactNet is a novel attention-based Graph Neural Network that can classify accurate protein-protein interaction models from docking algorithms without multiple sequence alignment, achieving higher accuracy than current methods.


AI-native Memory: A Pathway from LLMs Towards AGI

http://arxiv.org/abs/2406.18312v1

Compressor summary: The authors propose that integrating memory into large language models can help achieve artificial general intelligence by simplifying complex inferences and connecting related information.


Online Learning of Multiple Tasks and Their Relationships : Testing on Spam Email Data and EEG Signals Recorded in Construction Fields

http://arxiv.org/abs/2406.18311v1

Compressor summary: The paper proposes an online multi-task learning method that updates task relatedness iteratively using weight vectors and shows its superior performance over a conventional method on three datasets.


Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution

http://arxiv.org/abs/2406.18310v1

Compressor summary: This paper introduces STAR-RL, a hierarchical reinforcement learning framework for super-resolution of pathology images, which improves interpretability and diagnosis accuracy.


Automated Immunophenotyping Assessment for Diagnosing Childhood Acute Leukemia using Set-Transformers

http://arxiv.org/abs/2406.18309v1

Compressor summary: The paper introduces FCM-Former, a machine learning tool that automates immunophenotyping for diagnosing childhood acute leukemia, achieving 96.5% accuracy.


S3: A Simple Strong Sample-effective Multimodal Dialog System

http://arxiv.org/abs/2406.18305v1

Compressor summary: The S3 model, a simple yet powerful baseline for multimodal dialog, achieves near state-of-the-art results using a pre-trained language model and modality encoders with a small data mixture.


FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning

http://arxiv.org/abs/2406.18297v1

Compressor summary: This study explores using open-source large language models to identify check-worthy political statements and proposes a data pruning method to improve efficiency and performance.


Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI

http://arxiv.org/abs/2406.18295v1

Compressor summary: Foundation Models are better than problem-specific models for solving multiple Computer Vision problems with high accuracy, especially in Earth Observation applications where they are more efficient with limited data.


Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs

http://arxiv.org/abs/2406.18294v1

Compressor summary: The study proposes a Hierarchical Context Pruning strategy to improve repository-level code completion accuracy by modeling code files at the function level and removing irrelevant content, reducing input length for Repo-Code LLMs.


Combining Automated Optimisation of Hyperparameters and Reward Shape

http://arxiv.org/abs/2406.18293v1

Compressor summary: The paper proposes a method to automatically optimize hyperparameters and reward functions together for deep reinforcement learning, improving performance on complex tasks.


RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

http://arxiv.org/abs/2406.18284v1

Compressor summary: RealTalk is a new audio-driven framework for realistic face generation that preserves individual traits and works in real-time.


CAS: Confidence Assessments of classification algorithms for Semantic segmentation of EO data

http://arxiv.org/abs/2406.18279v1

Compressor summary: The CAS model assesses confidence levels of semantic segmentation algorithms for EO tasks like land cover classification, improving performance and preventing errors using satellite data.


Generalized Deepfake Attribution

http://arxiv.org/abs/2406.18278v1

Compressor summary: GDA-N et is a novel method for attributing deepfakes to their GAN architectures, even if they are generated from different seeds or fine-tuned versions of the models.


Sanskrit Knowledge-based Systems: Annotation and Computational Tools

http://arxiv.org/abs/2406.18276v1

Compressor summary: The authors present a framework for building knowledge graphs from Sanskrit texts, enhancing text analysis and paving the way for further advancements in computational Sanskrit.


"Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions

http://arxiv.org/abs/2406.18266v1

Compressor summary: The authors present the first open-source Romanian Large Language Models (RoLLMs), trained on translated texts and benchmarks, achieving state-of-the-art results across various categories.


Detecting Machine-Generated Texts: Not Just "AI vs Humans" and Explainability is Complicated

http://arxiv.org/abs/2406.18259v1

Compressor summary: This paper introduces a new ternary text classification scheme to better detect and explain when texts are written by LLMs or humans, addressing the growing concerns about authorship in the era of advanced LLMs.


LLaMIPa: An Incremental Discourse Parser

http://arxiv.org/abs/2406.18256v1

Compressor summary: The paper presents LLaMIPa, a discourse parser that uses an LLM finetuned on SDRT-style annotations and can process discourse data incrementally, improving performance over encoder-only models.


On the Role of Visual Grounding in VQA

http://arxiv.org/abs/2406.18253v1

Compressor summary: The text discusses visual grounding in vision-question answering (VQA), formalizes it as visually grounded reasoning, and proposes a method to create out-of-distribution tests that emphasize visual grounding.


Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems

http://arxiv.org/abs/2406.18245v1

Compressor summary: The text discusses a method to improve causal event extraction tasks by using evaluation models trained with reinforcement learning and weak-to-strong supervision, achieving high agreement with human judgement.


ConStyle v2: A Strong Prompter for All-in-One Image Restoration

http://arxiv.org/abs/2406.18242v1

Compressor summary: ConStyle v2 is a plug-and-play prompter that improves U-Net style image restoration models using pre-training, classification, and knowledge distillation techniques.


Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets

http://arxiv.org/abs/2406.18239v1

Compressor summary: This paper tests a new text-to-text annotation tool that uses large foundation models and shows it can perform as well as fine-tuned BERT on German Twitter data without any labeled training data.


PlaMo: Plan and Move in Rich 3D Physical Environments

http://arxiv.org/abs/2406.18237v1

Compressor summary: PlaMo is a system that combines scene-aware path planning and physics-based control to enable humanoids to navigate complex 3D environments.


CoDA: Interactive Segmentation and Morphological Analysis of Dendroid Structures Exemplified on Stony Cold-Water Corals

http://arxiv.org/abs/2406.18236v1

Compressor summary: CoDA is a software tool that allows the visual analysis of complex dendroid coral colonies, helping to understand their growth patterns and shapes.


GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

http://arxiv.org/abs/2406.18227v1

Compressor summary: The GUIDE dataset provides annotated instructional videos with guidelines to help beginners learn new tasks more effectively and offers three sub-tasks to evaluate model comprehension ability.


Enhancing Data Privacy in Large Language Models through Private Association Editing

http://arxiv.org/abs/2406.18221v1

Compressor summary: PAE is a novel technique that removes personal information from large language models without retraining them, enhancing their data privacy and preventing private data leakage.


Guiding Video Prediction with Explicit Procedural Knowledge

http://arxiv.org/abs/2406.18220v1

Compressor summary: The paper proposes a method to incorporate procedural knowledge into deep learning models for video prediction, leading to better performance than data-driven models alone.


A Closer Look into Mixture-of-Experts in Large Language Models

http://arxiv.org/abs/2406.18219v1

Compressor summary: This paper explores how mixture-of-experts (MoE) works in large language models, revealing its features and suggesting improvements for router design and expert allocation.


Unlocking the Potential of Operations Research for Multi-Graph Matching

http://arxiv.org/abs/2406.18215v1

Compressor summary: The paper proposes new approximation algorithms for incomplete multi-graph matching, which is important for computer vision tasks like image or shape matching, and shows that they significantly outperform existing methods in terms of objective and runtime.


Trimming the Fat: Efficient Compression of 3D Gaussian Splats through Pruning

http://arxiv.org/abs/2406.18214v1

Compressor summary: The study proposes a method called "Trimming the fat" to prune redundant information from 3D models, improving their scalability and performance while reducing memory usage and computation time.


SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding

http://arxiv.org/abs/2406.18200v1

Compressor summary: SeeD is a framework that improves the efficiency and speed of tree-search-based reasoning methods in large language models.


GS-Octree: Octree-based 3D Gaussian Splatting for Robust Object-level 3D Reconstruction Under Strong Lighting

http://arxiv.org/abs/2406.18199v1

Compressor summary: The text introduces a novel technique that combines octree-based implicit surface representations and Gaussian splatting to accurately reconstruct the geometry of target objects from multi-view images, especially under strong lighting conditions.


VDG: Vision-Only Dynamic Gaussian for Driving Simulation

http://arxiv.org/abs/2406.18198v1

Compressor summary: The paper proposes a new method (VDG) for dynamic scene reconstruction that integrates self-supervised vision optimization, improves initialization and decomposition, works with RGB images, and outperforms existing methods.


Human-free Prompted Based Anomaly Detection: prompt optimization with Meta-guiding prompt scheme

http://arxiv.org/abs/2406.18197v1

Compressor summary: The paper proposes a data-driven approach to learn prompts for prompt-based anomaly detection using synthesized samples and a modified attention mechanism for pixel-wise anomaly segmentation.


MammothModa: Multi-Modal Large Language Model

http://arxiv.org/abs/2406.18193v1

Compressor summary: MammothModa is a multi-modal language model with improved visual capabilities, extended context window, and high-quality bilingual datasets, achieving state-of-the-art performance in various benchmarks.


Methodology of Adapting Large English Language Models for Specific Cultural Contexts

http://arxiv.org/abs/2406.18192v1

Compressor summary: The paper proposes a method to adapt large language models for specific cultural contexts by tuning them with cultural knowledge and safety values data, improving their performance and adaptability.


Selective Prompting Tuning for Personalized Conversations with LLMs

http://arxiv.org/abs/2406.18187v1

Compressor summary: The paper proposes Selective Prompt Tuning (SPT), a method to improve conversational AI personalization by adaptively selecting and enhancing prompts using context-prompt contrastive learning and prompt fusion learning.


DeepExtremeCubes: Integrating Earth system spatio-temporal data for impact assessment of climate extremes

http://arxiv.org/abs/2406.18179v1

Compressor summary: The DeepExtremeCubes database is a new tool that uses satellite images and other data to help scientists study how extreme heatwaves and droughts affect ecosystems around the world.


Games of Knightian Uncertainty

http://arxiv.org/abs/2406.18178v1

Compressor summary: The paper argues that to advance AI through game research, we need to tackle Knightian uncertainty, or the ability to adapt to sudden rule changes in games without prior information or models.


VIPriors 4: Visual Inductive Priors for Data-Efficient Deep Learning Challenges

http://arxiv.org/abs/2406.18176v1

Compressor summary: The "VIPriors" workshop's fourth edition featured data-impaired challenges for computer vision tasks with limited data, where participants used inductive biases, data augmentation, and model ensembles to improve data efficiency.


UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

http://arxiv.org/abs/2406.18173v1

Compressor summary: UIO-LLMs are memory-enhanced transformers that use incremental optimization techniques to handle long texts more effectively and efficiently.


Start from Zero: Triple Set Prediction for Automatic Knowledge Graph Completion

http://arxiv.org/abs/2406.18166v1

Compressor summary: The paper introduces Triple Set Prediction (TSP), a novel graph-level KG completion task, and proposes GPHT, a subgraph-based method for fast prediction of missing triples, along with two baselines and evaluation metrics.


NeBuLa: A discourse aware Minecraft Builder

http://arxiv.org/abs/2406.18164v1

Compressor summary: NeBuLa is a model that uses prior context and nonlinguistic cues to improve language-to-action predictions in collaborative tasks.


Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models

http://arxiv.org/abs/2406.18159v1

Compressor summary: The paper proposes a diffusion model-based approach to generate 3D scenes from human motion sequences that adhere to spatial constraints, avoid collisions, and respect layout constraints.


SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery

http://arxiv.org/abs/2406.18151v1

Compressor summary: The paper introduces SynRS3D, a large synthetic remote sensing 3D dataset, and RS3DAda, a multi-task unsupervised domain adaptation method that enables global monocular 3D semantic understanding from synthetic data.


A Refer-and-Ground Multimodal Large Language Model for Biomedicine

http://arxiv.org/abs/2406.18146v1

Compressor summary: The authors introduce a new dataset (Med-GRIT-270k) and a multimodal language model (BiRD) for biomedical image refer and ground conversations, which can help develop intelligent biomedical assistants.


Artificial Immune System of Secure Face Recognition Against Adversarial Attacks

http://arxiv.org/abs/2406.18144v1

Compressor summary: The text reviews selective breeding techniques in insect farming, discussing how to improve traits, formulate objectives, and address genetic diversity issues for a more sustainable food source.


Exclusive Style Removal for Cross Domain Novel Class Discovery

http://arxiv.org/abs/2406.18140v1

Compressor summary: Key points: - Novel Class Discovery (NCD) is a task to cluster unseen novel classes based on labeled data in the same domain - The paper explores NCD in cross domain setting with the condition that style information must be removed - The paper introduces an exclusive style removal module to improve performance on different distributions - The paper builds a fair benchmark for future NCD research Summary: The paper proposes a style removal module for Novel Class Discovery in cross domain setting and creates a benchmark for evaluating NCD methods.


LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

http://arxiv.org/abs/2406.18139v1

Compressor summary: LOOK-M is a novel approach that efficiently compresses the multimodal KV cache in long-context MLLMs, enabling faster decoding and maintained or improved task performance.


Automatic Speech Recognition for Hindi

http://arxiv.org/abs/2406.18135v1

Compressor summary: The text describes an ASR web application and interface that handles large volumes of audio files, transcriptions, and voice activity detection using neural networks and machine learning techniques.


Assessing "Implicit" Retrieval Robustness of Large Language Models

http://arxiv.org/abs/2406.18134v1

Compressor summary: The paper evaluates how well large language models handle relevant or irrelevant retrieved context without explicit relevance judgments, finding that fine-tuning on mixed context improves robustness to retrieval inaccuracies.


ConvoCache: Smart Re-Use of Chatbot Responses

http://arxiv.org/abs/2406.18133v1

Compressor summary: ConvoCache is a caching system for spoken chatbots that reuses responses to similar prompts, improving speed and cost-efficiency.


Sequential Disentanglement by Extracting Static Information From A Single Sequence Element

http://arxiv.org/abs/2406.18131v1

Compressor summary: The text proposes a novel architecture that reduces information leakage in unsupervised sequential disentanglement by using a subtraction inductive bias with only one sample, improving performance on various data-modality benchmarks.


CTS: Sim-to-Real Unsupervised Domain Adaptation on 3D Detection

http://arxiv.org/abs/2406.18129v1

Compressor summary: The paper proposes a novel Complex-to-Simple framework to improve sim-to-real domain adaptation in 3D object detection using fixed-size anchor heads, RoI augmentation, corner-format representation of aleatoric uncertainty, and noise-aware mean teacher method.


ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models

http://arxiv.org/abs/2406.18125v1

Compressor summary: The authors present a comprehensive approach to resume classification using large-scale datasets and advanced language models, achieving significant improvements over traditional machine learning methods.


Poisoned LangChain: Jailbreak LLMs by LangChain

http://arxiv.org/abs/2406.18122v1

Compressor summary: Key points: - Jailbreak attacks aim to evade model safety mechanisms and generate inappropriate content - Existing jailbreak attacks rely on crafting inducement prompts, which are less effective against large models with robust filtering - Retrieval-Augmented Generation (RAG) uses external knowledge bases to improve model capabilities - Poisoned-LangChain (PLC) is a novel indirect jailbreak attack that leverages a poisoned external knowledge base - PLC successfully attacked six different large language models under three scenarios with high success rates Summary: The paper introduces Poisoned-LangChain, an indirect jailbreak attack that exploits a poisoned external knowledge base to deceive large language models and make them generate malicious content.


ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs

http://arxiv.org/abs/2406.18120v1

Compressor summary: The paper presents methods for machine translation and speech recognition of code-switched Egyptian Arabic-English, using large language models and achieving significant improvements.


Robust personnel rostering: how accurate should absenteeism predictions be?

http://arxiv.org/abs/2406.18119v1

Compressor summary: The paper proposes a method to evaluate and improve roster robustness for employee scheduling using machine learning predictions of absenteeism and optimization.


BADGE: BADminton report Generation and Evaluation with LLM

http://arxiv.org/abs/2406.18116v1

Compressor summary: The authors propose BADGE, a framework that uses a large language model (LLM) to generate and evaluate badminton reports automatically, potentially enhancing sports promotion.


The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

http://arxiv.org/abs/2406.18113v1

Compressor summary: The paper presents Mr. BLIP, a simple and effective model that uses image-text pretrained multimodal language models to achieve state-of-the-art results in video moment retrieval tasks without additional input signals or complex architectures.


Token-Weighted RNN-T for Learning from Flawed Data

http://arxiv.org/abs/2406.18108v1

Compressor summary: The paper proposes a method to improve ASR models by adjusting the importance of tokens based on their likelihood of being errors, helping to reduce accuracy loss from transcription errors in training data.


Shimo Lab at "Discharge Me!": Discharge Summarization by Prompt-Driven Concatenation of Electronic Health Record Sections

http://arxiv.org/abs/2406.18094v1

Compressor summary: The paper describes an approach to automatically generate "Brief Hospital Course" and "Discharge Instructions" sections for EHRs using LoRA fine-tuning on ClinicalT5-large, achieving a ROUGE-1 score of 0.394.


LLM-Driven Multimodal Opinion Expression Identification

http://arxiv.org/abs/2406.18088v1

Compressor summary: The study introduces a new multimodal opinion expression identification task, using text and speech inputs, and proposes an LLM-driven method that significantly improves performance.


Multilingual Knowledge Graph Completion from Pretrained Language Models with Knowledge Constraints

http://arxiv.org/abs/2406.18085v1

Compressor summary: The paper proposes global and local knowledge constraints for multilingual knowledge graph completion, improving the performance on Hits@1 and Hits@10 by over 12% and 16%.


Octo-planner: On-device Language Model for Planner-Action Agents

http://arxiv.org/abs/2406.18082v1

Compressor summary: Key points: - The paper presents an efficient Planner-Action framework for AI agents on edge devices - It uses Phi-3 Mini (LLM) for planning and Octopus model for action execution - It employs model fine-tuning instead of in-context learning to reduce costs and improve response times - It uses multi-LoRA training to handle multi-domain queries Summary: The paper introduces an on-device Planner-Action framework that separates planning (Phi-3 Mini) and action execution (Octopus) for AI agents, using model fine-tuning and multi-LoRA training to optimize performance and flexibility.


MFDNet: Multi-Frequency Deflare Network for Efficient Nighttime Flare Removal

http://arxiv.org/abs/2406.18079v1

Compressor summary: MFDNet is a lightweight network that uses Laplacian Pyramid to decompose images into low and high-frequency bands for flare removal while preserving image quality.


Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction

http://arxiv.org/abs/2406.18078v1

Compressor summary: The paper proposes a self-training framework with a pseudo-label scorer for aspect-based sentiment analysis, which improves performance by filtering out mismatches and using a human-annotated comparison dataset.


Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

http://arxiv.org/abs/2406.18074v1

Compressor summary: DSPNet is a novel few-shot semantic segmentation method for medical imaging that constructs high-fidelity prototypes for object foreground and background using multi-modal clustering and channel-aware regulation.


EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

http://arxiv.org/abs/2406.18070v1

Compressor summary: The authors present EgoVideo, a novel egocentric foundation model for various tasks in the Ego4D and EPIC-Kitchens challenges, showcasing its versatility and effectiveness.


Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

http://arxiv.org/abs/2406.18068v1

Compressor summary: The paper presents a method to generate realistic co-speech facial expressions and upper-body gestures for digital characters using RGB video data and multimodal learning.


Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification

http://arxiv.org/abs/2406.18067v1

Compressor summary: MEJEM is a new dialect identification model that uses energy margin loss to detect out-of-distribution data better than existing methods.


Learning Optimal Filters Using Variational Inference

http://arxiv.org/abs/2406.18066v1

Compressor summary: The paper presents a method to learn how to filter nonlinear dynamical systems using variational inference, which can improve existing filtering techniques like the ensemble Kalman filter.


Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

http://arxiv.org/abs/2406.18064v1

Compressor summary: vRAG-Eval is a system that grades answer quality in RAG applications using binary scores, and shows promising alignment with human experts for GPT-4.


AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

http://arxiv.org/abs/2406.18060v1

Compressor summary: The paper introduces AdaZeta, a framework to improve the performance, convergence, and memory efficiency of fine-tuning large language models using zeroth-order methods.


Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies

http://arxiv.org/abs/2406.18053v1

Compressor summary: BrHPO is a new HRL algorithm that allows both levels to communicate and correct errors, improving subgoal reachability and performance on long-horizon tasks.


ViT-1.58b: Mobile Vision Transformers in the 1-bit Era

http://arxiv.org/abs/2406.18051v1

Compressor summary: ViT-1.58b is a novel, low-precision quantized ViT model that achieves comparable performance to full-precision models while reducing memory and computational costs for resource-constrained environments.


A Multi-Stage Goal-Driven Network for Pedestrian Trajectory Prediction

http://arxiv.org/abs/2406.18050v1

Compressor summary: MGNet predicts pedestrian trajectories by forecasting intermediate goals using a CVAE, attention module, and goal evaluator, improving safety and efficiency in applications like autonomous vehicles.


ScanFormer: Referring Expression Comprehension by Iteratively Scanning

http://arxiv.org/abs/2406.18048v1

Compressor summary: The paper introduces ScanFormer, an efficient image-text alignment model that iteratively extracts linguistically-relevant visual patches from images using informativeness prediction and patch selection.


PharmGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

http://arxiv.org/abs/2406.18045v1

Compressor summary: PharmGPT is a specialized LLM suite that outperforms general models on bio-pharmaceutical and chemical tasks, opening new possibilities for NLP in these domains.


Multimodal foundation world models for generalist embodied agents

http://arxiv.org/abs/2406.18043v1

Compressor summary: GenRL is a multimodal framework that connects vision-language models with generative world models for reinforcement learning, enabling generalist embodied agents to learn diverse tasks in different domains.


MT2ST: Adaptive Multi-Task to Single-Task Learning

http://arxiv.org/abs/2406.18038v1

Compressor summary: The MT2ST framework combines the benefits of multi-task and single-task learning in word embedding training, reducing overfitting and training time significantly.


Towards Synchronous Memorizability and Generalizability with Site-Modulated Diffusion Replay for Cross-Site Continual Segmentation

http://arxiv.org/abs/2406.18037v1

Compressor summary: This paper proposes SMG-Learning, a novel training paradigm for deep networks to improve sequential learning from different medical image sites by using Parallel Gradient Alignment and Site-Modulated Diffusion techniques.


LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them

http://arxiv.org/abs/2406.18034v1

Compressor summary: The paper introduces DoctorFLAN, a Chinese medical dataset for tuning LLMs to be effective medical assistants who collaborate with doctors, based on a survey of their needs.


Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization

http://arxiv.org/abs/2406.18035v1

Compressor summary: The paper introduces a new concept called "local linear recovery" to analyze how deep neural network models can reliably recover target functions at overparameterization.


Boosting Soft Q-Learning by Bounding

http://arxiv.org/abs/2406.18033v1

Compressor summary: Soft Q-learning uses value function estimates to derive optimal value bounds and improve training performance.


Real-time Structure Flow

http://arxiv.org/abs/2406.18031v1

Compressor summary: The article presents a structure flow field method that uses PDEs and predictor-update algorithms to generate real-time high-speed motion information for robotic devices and autonomous vehicles using images and depth measurements.


Automated Clinical Data Extraction with Knowledge Conditioned LLMs

http://arxiv.org/abs/2406.18027v1

Compressor summary: The proposed framework uses in-context learning to align internal and external knowledge, improving the accuracy of lung lesion information extraction from text using a two-stage approach.


AutoOPE: Automated Off-Policy Estimator Selection

http://arxiv.org/abs/2406.18022v1

Compressor summary: The paper proposes an automated machine learning method for selecting the best Off-Policy Evaluation (OPE) estimator based on synthetic tasks.


MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

http://arxiv.org/abs/2406.18020v1

Compressor summary: MolFusion is a method for combining molecular representations to predict drug properties better by aligning molecular-level and atomic-level information.


View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adaptive View Synthesis

http://arxiv.org/abs/2406.18012v1

Compressor summary: The paper introduces OmniAD, a novel network for unsupervised anomaly detection between imperfectly aligned images of infrastructure scenes using refined reverse distillation and new data augmentation strategies.


Expressive Keypoints for Skeleton-based Action Recognition via Skeleton Transformation

http://arxiv.org/abs/2406.18011v1

Compressor summary: Expressive Keypoints with Skeleton Transformation and Instance Pooling improve skeleton-based action recognition by capturing subtle human actions using fine-grained joint details.


Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher

http://arxiv.org/abs/2406.18002v1

Compressor summary: The paper proposes an algorithm to use limited supervision from LLMs to improve sLLM's generative quality by adaptively trusting or ignoring LLM predictions based on sLLM confidence.


Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model

http://arxiv.org/abs/2406.17998v1

Compressor summary: This paper proposes a generative model that simulates changes over time using a probabilistic graphical model, enabling cheap and automatic data generation for training deep vision models in Earth's surface dynamics.


Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models

http://arxiv.org/abs/2406.17992v1

Compressor summary: DELD is a method that detects evolving disinformation generated by large language models using pre-trained language models and soft prompts.


Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models

http://arxiv.org/abs/2406.17990v1

Compressor summary: Explicit diversity conditions improve question answering system accuracy by generating more diverse and relevant synthetic data, especially in low-resource domains.


Learning Neural Networks with Sparse Activations

http://arxiv.org/abs/2406.17989v1

Compressor summary: This paper studies the learnability and benefits of MLP layers with sparse activations, which are common in neural network architectures but harder to optimize.


DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

http://arxiv.org/abs/2406.17988v1

Compressor summary: DICE is an end-to-end method that uses a Transformer-based architecture to recover 3D hand-face interactions from a single image, achieving state-of-the-art performance and interactive speed.


Multi-step Knowledge Retrieval and Inference over Unstructured Data

http://arxiv.org/abs/2406.17987v1

Compressor summary: Cora is a neuro-symbolic AI platform developed by Elemental Cognition that enhances natural language understanding and reasoning for high-stakes domains, outperforming pure LLM or RAG approaches.