arxiv compressed, 2024-01-30

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-01-30 generated by the compressor, my personal LLM-based project.


Computer Vision for Primate Behavior Analysis in the Wild

http://arxiv.org/abs/2401.16424v1

Compressor summary: The paper discusses current computer vision methods for studying animal behavior in videos and suggests future directions to overcome practical challenges.


Synchformer: Efficient Synchronization from Sparse Cues

http://arxiv.org/abs/2401.16423v1

Compressor summary: Key points: - Objective: synchronize audio and video in real-world videos with sparse cues - Contributions: novel model, multi-modal pre-training, state-of-the-art performance, AudioSet dataset, interpretability, audio-visual synchronizability Summary: The paper presents a new audio-visual synchronization model that uses multi-modal pre-training and performs well on real-world videos with sparse cues. It also explores new aspects like interpretability and synchronizability.


Strategic Usage in a Multi-Learner Setting

http://arxiv.org/abs/2401.16422v1

Compressor summary: The paper studies how strategic users choosing among multiple online services affect convergence or oscillation in service optimization, and shows that memory-based retraining can ensure convergent behavior for some loss functions.


Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation

http://arxiv.org/abs/2401.16421v1

Compressor summary: BiPE is a new positional encoding method that combines intra- and inter-segment encodings to improve semantic information and extrapolation capabilities for language sequences.


InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

http://arxiv.org/abs/2401.16420v1

Compressor summary: InternLM-XComposer2 is a powerful vision-language model that creates and understands text-image compositions, using Partial LoRA to balance precision and creativity.


Semi-parametric Expert Bayesian Network Learning with Gaussian Processes and Horseshoe Priors

http://arxiv.org/abs/2401.16419v1

Compressor summary: The paper presents a new SEBN model with linear constraints, Gaussian Processes, and Horseshoe prior to learn semi-parametric relationships, improve interpretability, and achieve better performance on synthetic and UCI Liver Disorders datasets.


Endo-4DGS: Distilling Depth Ranking for Endoscopic Monocular Scene Reconstruction with 4D Gaussian Splatting

http://arxiv.org/abs/2401.16416v1

Compressor summary: Endo-4DGS is a real-time method for dynamic scene reconstruction in robot-assisted surgery using 4D Gaussian Splatting and monocular views, which improves surgical outcomes.


Learning to Manipulate under Limited Information

http://arxiv.org/abs/2401.16412v1

Compressor summary: The study uses neural networks to test the vulnerability of different voting methods to strategic manipulation under various conditions and finds significant differences in manipulability among them.


Scaling Sparse Fine-Tuning to Large Language Models

http://arxiv.org/abs/2401.16405v1

Compressor summary: The paper presents a sparse fine-tuning method that improves performance of large language models without increasing memory requirements, making it compatible with quantization and efficient optimizers.


ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text

http://arxiv.org/abs/2401.16403v1

Compressor summary: ViLexNorm is a corpus for Vietnamese lexical normalization, which improves various NLP tasks by transforming words into their standard forms using sentences from social media.


A Survey on Visual Anomaly Detection: Challenge, Approach, and Prospect

http://arxiv.org/abs/2401.16402v1

Compressor summary: This paper surveys Visual Anomaly Detection (VAD), discussing its challenges and recent advancements in dealing with scarce, diverse, and complex data.


Zero-shot Imitation Policy via Search in Demonstration Dataset

http://arxiv.org/abs/2401.16398v1

Compressor summary: Key points: - Behavioral cloning uses demonstrations to learn a policy - The authors propose using latent spaces of pre-trained models to find similar experiences and copy behavior - The approach is tested on MineRL-dataset in Minecraft environment - Search-based approach outperforms learning-based models in accuracy and perceptual evaluation Summary: The paper proposes a search-based method for behavioral cloning using latent spaces of pre-trained models, which achieves better results than learning-based models in Minecraft.


Amazon's 2023 Drought: Sentinel-1 Reveals Extreme Rio Negro River Contraction

http://arxiv.org/abs/2401.16393v1

Compressor summary: Key points: - The Amazon is facing a severe drought affecting the Rio Negro River - Researchers used a U-net model to map water surfaces from Sentinel-1 satellite images with high accuracy - The water surface reached its lowest level in November 2023, reduced by 68.1% compared to the maximum Summary: Using deep learning and satellite data, researchers mapped the drastic decline of the Rio Negro River water surface in the Amazon drought.


Continual Learning with Pre-Trained Models: A Survey

http://arxiv.org/abs/2401.16386v1

Compressor summary: This paper surveys recent progress in using pre-trained models for continual learning, categorizes existing methods, and discusses fairness in comparisons.


Learning logic programs by finding minimal unsatisfiable subprograms

http://arxiv.org/abs/2401.16383v1

Compressor summary: ILP uses minimal unsatisfiable subprograms (MUSPs) to efficiently and soundly prune the search space, reducing learning times by up to 99%.


Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

http://arxiv.org/abs/2401.16380v1

Compressor summary: WRAP uses an instruction-tuned model to paraphrase documents in specific styles, speeding up pre-training and improving LLM performance on noisy data.


Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator

http://arxiv.org/abs/2401.16375v1

Compressor summary: The paper analyzes different layout generation methods, proposes a learning-based locator to detect errors in generated layouts using wireframe images, and shows improved results over existing approaches.


Bayesian optimization as a flexible and efficient design framework for sustainable process systems

http://arxiv.org/abs/2401.16373v1

Compressor summary: Bayesian optimization is a useful tool for optimizing complex functions in various fields, but there are still challenges and opportunities to make it more efficient and effective.


TQCompressor: improving tensor decomposition methods in neural networks via permutations

http://arxiv.org/abs/2401.16367v1

Compressor summary: TQCompressor compresses pre-trained language models like GPT-2 by using improved tensor decompositions and a training strategy that achieves better performance than other compression methods.


PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

http://arxiv.org/abs/2401.16355v1

Compressor summary: PathMMU is a high-quality benchmark for large multimodal models in pathology, but even advanced AI models struggle to match human expertise.


ConFit: Improving Resume-Job Matching using Data Augmentation and Contrastive Learning

http://arxiv.org/abs/2401.16349v1

Compressor summary: Key points: - A system that matches resumes and jobs using data augmentation and contrastive learning - ConFit creates an augmented dataset by paraphrasing resume and job sections - Outperforms prior methods in ranking jobs and resumes Summary: ConFit is a system that uses data augmentation and contrastive learning to match resumes and jobs, creating an enhanced dataset and improving ranking performance.


Beyond Automated Evaluation Metrics: Evaluating Topic Models On Practical Social Science Content Analysis Tasks

http://arxiv.org/abs/2401.16348v1

Compressor summary: The paper compares different types of topic models in real-world content analysis and document annotation tasks, finding that some neural models perform better than classical ones despite questionable validity of automated evaluation metrics.


Cross-Modal Coordination Across a Diverse Set of Input Modalities

http://arxiv.org/abs/2401.16347v1

Compressor summary: The paper proposes two methods for cross-modal retrieval using multiple input modalities and shows their effectiveness in improving retrieval performance.


Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

http://arxiv.org/abs/2401.16335v1

Compressor summary: RLHF improves human-centric values in language models by using Iterative Data Smoothing (IDS) to update data and model simultaneously, reducing reward model degradation.


Synthesis of 3D on-air signatures with the Sigma-Lognormal model

http://arxiv.org/abs/2401.16329v1

Compressor summary: Key points: - Signature synthesis is a technique to generate artificial signatures for verification - The paper proposes a 3D framework using lognormality principle and neuromotor control processes - The paper shows the synthesis of trajectories, velocities, duplicates, and air writing and gestures - The paper demonstrates the performance and human likeness of the synthetic signatures Summary: The paper presents a 3D signature synthesis framework that generates artificial specimens for verification, based on neuromotor control processes and lognormality principle. It also shows how to synthesize trajectories, duplicates, air writing and gestures, and evaluates the quality and realism of the synthetic signatures.


PICL: Physics Informed Contrastive Learning for Partial Differential Equations

http://arxiv.org/abs/2401.16327v1

Compressor summary: The authors propose a novel method to improve neural operators' generalization across multiple governing PDEs using physics-informed contrastive pretraining.


Defining and Extracting generalizable interaction primitives from DNNs

http://arxiv.org/abs/2401.16318v1

Compressor summary: The text describes a new method to extract and generalize interactions between input variables encoded by deep neural networks for the same task, which improves explainability in AI.


Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets

http://arxiv.org/abs/2401.16313v1

Compressor summary: ACES is a large challenge set that tests 50 machine translation metrics on 68 different types of errors across 146 languages, revealing their strengths and weaknesses, and suggesting improvements for metric design.


MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection

http://arxiv.org/abs/2401.16305v1

Compressor summary: MixSup is a 3D object detection method that uses cheap cluster labels for semantics and expensive box labels for poses and shapes, achieving high performance with reduced annotation effort.


Regressing Transformers for Data-efficient Visual Place Recognition

http://arxiv.org/abs/2401.16304v1

Compressor summary: The paper proposes a regression-based method for visual place recognition that improves ranking accuracy and efficiency by using camera field-of-view overlap as ground truth.


Enhancing Molecular Property Prediction with Auxiliary Learning and Task-Specific Adaptation

http://arxiv.org/abs/2401.16299v1

Compressor summary: The paper proposes methods to improve generalization of pretrained Graph Neural Networks for molecular property prediction by jointly training them with auxiliary tasks and adapting their weights or gradients.


Breaking the Barrier: Selective Uncertainty-based Active Learning for Medical Image Segmentation

http://arxiv.org/abs/2401.16298v1

Compressor summary: Selective Uncertainty-based Active Learning prioritizes pixels within target areas and near decision boundaries to improve medical image segmentation, outperforming conventional uncertainty-based methods.


Dual feature-based and example-based explanation methods

http://arxiv.org/abs/2401.16294v1

Compressor summary: A new method for explaining machine learning models uses convex hulls and dual representations to generate examples and calculate feature importance values, improving on LIME.


Textual Entailment for Effective Triple Validation in Object Prediction

http://arxiv.org/abs/2401.16293v1

Compressor summary: The paper proposes using textual entailment to validate facts extracted from language models for knowledge base population, improving accuracy and reducing unintended or hallucinatory results.


MachineLearnAthon: An Action-Oriented Machine Learning Didactic Concept

http://arxiv.org/abs/2401.16291v1

Compressor summary: MachineLearnAthon is a new teaching concept that uses real-world ML challenges with industrial data sets to promote interdisciplinary and practical skills in students of various backgrounds.


GAPS: Geometry-Aware Problem Solver

http://arxiv.org/abs/2401.16287v1

Compressor summary: GAPS is a novel model that solves geometry math problems by generating solution programs as compositions of operators and operands, and outperforms Geoformer on calculation and proving tasks.


Capturing Pertinent Symbolic Features for Enhanced Content-Based Misinformation Detection

http://arxiv.org/abs/2401.16285v1

Compressor summary: The paper proposes a new method for detecting misleading content using linguistic features and symbolic knowledge, achieving state-of-the-art performance in various datasets.


Leveraging Positional Encoding for Robust Multi-Reference-Based Object 6D Pose Estimation

http://arxiv.org/abs/2401.16284v1

Compressor summary: The paper proposes new strategies to improve deep learning methods for estimating the pose of an object in computer vision and robotics by addressing their limitations.


MAPLE: Micro Analysis of Pairwise Language Evolution for Few-Shot Claim Verification

http://arxiv.org/abs/2401.16282v1

Compressor summary: The paper introduces MAPLE, a method for verifying claims using limited data and unlabelled pairwise data, which outperforms existing approaches in fact-checking tasks.


Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a Large Foundational Video Understanding Model

http://arxiv.org/abs/2401.16280v1

Compressor summary: Key points: - A large video understanding model is used for human fall detection on untrimmed video - A pretrained vision transformer detects three classes of actions: Fall, Lying and Other/ADL - A simple cutup method and a preprocessing pipeline are used to create labeled action clips - The method achieves state-of-the-art results on the HQFSD dataset under real-time settings Summary: The paper proposes a method that uses a video understanding model with a pretrained vision transformer to detect falls and other activities on untrimmed videos, and shows its effectiveness on a public dataset.


Capturing Knowledge Graphs and Rules with Octagon Embeddings

http://arxiv.org/abs/2401.16270v1

Compressor summary: The paper proposes using axis-aligned octagons for region based knowledge graph embeddings to overcome limitations in modeling relational composition and rules.


CO2: Efficient Distributed Training with Full Communication-Computation Overlap

http://arxiv.org/abs/2401.16265v1

Compressor summary: CO2 is a new approach that allows large language models to be trained efficiently even on clusters with limited communication bandwidth by combining local updating, asynchronous communication, and advanced techniques for stability and convergence.


Towards Red Teaming in Multimodal and Multilingual Translation

http://arxiv.org/abs/2401.16247v1

Compressor summary: The paper explores human-based red teaming for machine translation, assessing model performance and reliability by generating edge cases that reveal critical errors.


Clinically meaningful timeline summarisation in social media for mental health monitoring

http://arxiv.org/abs/2401.16240v1

Compressor summary: The paper proposes a novel method for unsupervised abstractive summarization of social media user timelines for mental health monitoring, using a hierarchical variational autoencoder and a large language model.


Effective Communication with Dynamic Feature Compression

http://arxiv.org/abs/2401.16236v1

Compressor summary: The authors propose a system that uses DRL to optimize communication between an observer and a robot controller in a 5G wireless network, improving performance on a simulated task.


Player Pressure Map - A Novel Representation of Pressure in Soccer for Evaluating Player Performance in Different Game Contexts

http://arxiv.org/abs/2401.16235v1

Compressor summary: The paper introduces a player pressure map to visualize and evaluate the pressure experienced by soccer teams in game scenes, helping coaches improve players' performance under pressure.


Cross-Database Liveness Detection: Insights from Comparative Biometric Analysis

http://arxiv.org/abs/2401.16232v1

Compressor summary: The paper evaluates liveness detection models for biometric security and highlights the challenges and gaps in cross-database scenarios.


Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models

http://arxiv.org/abs/2401.16224v1

Compressor summary: The paper proposes Diffutoon, a toon shading method based on diffusion models that can render photorealistic videos in anime style and edit them according to prompts.


Learning big logical rules by joining small rules

http://arxiv.org/abs/2401.16215v1

Compressor summary: The paper proposes a method to combine small rules into large ones for inductive logic programming, using constraint solvers to improve accuracy on various domains like game playing and drug design.


MultiMUC: Multilingual Template Filling on MUC-4

http://arxiv.org/abs/2401.16209v1

Compressor summary: MultiMUC is a multilingual parallel corpus for template filling with translations into five languages and human annotations.


Geospatial Disparities: A Case Study on Real Estate Prices in Paris

http://arxiv.org/abs/2401.16197v1

Compressor summary: Key points: - Geospatial data is crucial for predictive models but may perpetuate historical biases and exclusionary practices - The paper proposes a toolkit to identify and mitigate such biases, especially in ordinal regression with spatial attributes - The paper illustrates the methodology using a Parisian real estate dataset and discusses the implications of geographical aggregation levels for fairness and calibration Summary: The paper introduces a toolkit to address geospatial data biases in predictive models, especially in ordinal regression, and applies it to a Parisian real estate dataset, examining how different aggregation levels affect fairness and calibration.


Contributing Dimension Structure of Deep Feature for Coreset Selection

http://arxiv.org/abs/2401.16193v1

Compressor summary: The text proposes a novel feature-based diversity constraint for coreset selection that considers both similarity and contribution of dimensions, improving performance and diversity in deep learning.


FIMP: Future Interaction Modeling for Multi-Agent Motion Prediction

http://arxiv.org/abs/2401.16189v1

Compressor summary: FIMP is a novel method for autonomous driving that predicts multi-agent motions by capturing potential future interactions using feature-level decoding and affinity learning.


On the Semantics of LM Latent Space: A Vocabulary-defined Approach

http://arxiv.org/abs/2401.16184v1

Compressor summary: The text introduces a new method for analyzing language models' latent space that provides absolute and model-centric insights into their semantics, improving their performance and interpretability.


LLaMandement: Large Language Models for Summarization of French Legislative Proposals

http://arxiv.org/abs/2401.16182v1

Compressor summary: The French government created LLaMandement, a Large Language Model that generates neutral summaries of legislative proposals and helps process parliamentary sessions efficiently and effectively.


A Survey on Structure-Preserving Graph Transformers

http://arxiv.org/abs/2401.16176v1

Compressor summary: This paper provides a comprehensive overview of structure-preserving graph transformers and categorizes their strategies based on their design objectives and goals for preserving graph structures.


Constrained Bi-Level Optimization: Proximal Lagrangian Value function Approach and Hessian-free Algorithm

http://arxiv.org/abs/2401.16164v1

Compressor summary: The paper proposes a new Hessian-free algorithm for solving constrained Bi-Level Optimization problems in machine learning, using a smooth proximal Lagrangian value function to handle constraints.


LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

http://arxiv.org/abs/2401.16160v1

Compressor summary: The authors propose a sparse mixture of LoRA experts for instruction finetuning MLLMs to handle data conflicts from distinct domains and achieve better performance.


Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

http://arxiv.org/abs/2401.16158v1

Compressor summary: Key points: - Mobile-Agent is a vision-centric mobile device agent that can perform complex tasks on apps. - It does not require system-specific customizations or XML files of apps. - Mobile-Eval is a benchmark for evaluating Mobile-Agent, and the results are promising. Summary: Mobile-Agent is a vision-centric mobile device agent that can navigate and operate apps without system-specific customizations, using Mobile-Eval to demonstrate its accuracy and versatility.


Spatial-Aware Latent Initialization for Controllable Image Generation

http://arxiv.org/abs/2401.16157v1

Compressor summary: Our approach uses a spatial-aware initialization noise to improve text-to-image generation by leveraging inverted reference images for layout control.


Divide and Conquer: Rethinking the Training Paradigm of Neural Radiance Fields

http://arxiv.org/abs/2401.16144v1

Compressor summary: The paper proposes a new training method for neural radiance fields (NeRFs) that improves rendering quality by dividing input views into groups based on visual similarity and training specialized models on each group, then aggregating their knowledge using distillation.


X-PEFT: eXtremely Parameter-Efficient Fine-Tuning for Extreme Multi-Profile Scenarios

http://arxiv.org/abs/2401.16137v1

Compressor summary: X-PEFT is a novel method that uses binary masks to select adapters efficiently for multiple profiles, outperforming conventional adapter tuning with much less memory usage.


BooleanOCT: Optimal Classification Trees based on multivariate Boolean Rules

http://arxiv.org/abs/2401.16133v1

Compressor summary: The new mixed-integer programming approach improves the accuracy of optimal classification trees, especially for small and medium-sized datasets, while incorporating both linear and nonlinear metrics.


CIMIL-CRC: a clinically-informed multiple instance learning framework for patient-level colorectal cancer molecular subtypes classification from H\&E stained images

http://arxiv.org/abs/2401.16131v1

Compressor summary: CIMIL-CRC, a deep neural network framework, uses clinical information to improve the classification of colorectal cancer subtypes from whole-slide images.


On the generalization of learned constraints for ASP solving in temporal domains

http://arxiv.org/abs/2401.16124v1

Compressor summary: The paper explores how to generalize dynamic constraints in ASP to improve temporal problem solving and evaluates the effect on solver performance.


DeFlow: Decoder of Scene Flow Network in Autonomous Driving

http://arxiv.org/abs/2401.16122v1

Compressor summary: DeFlow uses a GRU refinement module to transition from voxel-based to point-based features for scene flow estimation, improving performance and efficiency on large-scale point cloud data.


Triple Disentangled Representation Learning for Multimodal Affective Analysis

http://arxiv.org/abs/2401.16119v1

Compressor summary: TriDiRA is a novel approach that disentangles three types of modality-specific representations to improve multimodal learning for affective analysis tasks.


Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

http://arxiv.org/abs/2401.16110v1

Compressor summary: The paper proposes a framework for improving vision-based roadside detection in autonomous vehicles by mitigating background overfitting and generating diverse training data using unlabeled images, leading to better performance on new scenes.


Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis

http://arxiv.org/abs/2401.16107v1

Compressor summary: The text describes an AI framework that mimics real-world medical consultations to improve automatic diagnosis using large language models without much training time.


A 2D Sinogram-Based Approach to Defect Localization in Computed Tomography

http://arxiv.org/abs/2401.16104v1

Compressor summary: This paper introduces a three-step deep learning algorithm for detecting and analyzing defects in industrial computed tomography using sinograms instead of reconstructed images, achieving high accuracy and precision.


Flexible Parallel Neural Network Architecture Model for Early Prediction of Lithium Battery Life

http://arxiv.org/abs/2401.16102v1

Compressor summary: The paper proposes a flexible deep learning model (FPNN) that effectively predicts battery life using electrochemical features extracted from video-like data, achieving high accuracy and interpretability in different tasks.


Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You

http://arxiv.org/abs/2401.16092v1

Compressor summary: The paper introduces MAGBIG, a benchmark to study gender bias in multilingual text-to-image generation models, and finds that they deviate from normative assumptions and differ across languages.


Fairness in Algorithmic Recourse Through the Lens of Substantive Equality of Opportunity

http://arxiv.org/abs/2401.16088v1

Compressor summary: The paper proposes two fairness metrics for algorithmic recourse that consider time and effort, and tests them on a simulated recourse system.


High Resolution Image Quality Database

http://arxiv.org/abs/2401.16087v1

Compressor summary: The paper introduces HRIQ, a high-resolution image quality database for training blind image quality assessment models to accurately predict MOS of high-resolution images.


Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

http://arxiv.org/abs/2401.16086v1

Compressor summary: The paper shows that non-fluent target-side synthetic training samples can improve multilingual machine translation performance across different tasks and domains.


Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

http://arxiv.org/abs/2401.16078v1

Compressor summary: The paper investigates how word-level linguistic annotations affect under-resourced neural machine translation and finds that source-language annotations are generally helpful, while target-language annotations perform better with part-of-speech tags than morpho-syntactic description tags.


Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas

http://arxiv.org/abs/2401.16076v1

Compressor summary: The paper presents a method to predict the most engaging moments for creating trailers using both visual and dialogue information, and tests it on a new soap opera dataset.


Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing

http://arxiv.org/abs/2401.16055v1

Compressor summary: The attacker can build a local model based on the victim's outputs, and the choice of vocabulary does not significantly affect the performance, but the victim's vocabulary can be extracted from the outputs.


Dynamic Prototype Adaptation with Distillation for Few-shot Point Cloud Segmentation

http://arxiv.org/abs/2401.16051v1

Compressor summary: The paper proposes dynamic prototype adaptation (DPA), a method that learns task-specific prototypes for segmenting point clouds with minimal supervision, and outperforms existing methods on two benchmarks.


Type-based Neural Link Prediction Adapter for Complex Query Answering

http://arxiv.org/abs/2401.16045v1

Compressor summary: The paper introduces TENLPA, a novel model that uses type information in knowledge graphs to improve complex logical query answering by constructing type-based entity-relation graphs and adaptively adjusting neural link predictors.


Second Order Kinematic Surface Fitting in Anatomical Structures

http://arxiv.org/abs/2401.16035v1

Compressor summary: The paper proposes a second order velocity field method for kinematic surface fitting that improves symmetry detection and morphological classification in medical image analysis.


Domain adaptation strategies for 3D reconstruction of the lumbar spine using real fluoroscopy data

http://arxiv.org/abs/2401.16027v1

Compressor summary: The study presents a refined X23D model that generates accurate 3D spine reconstructions from few intraoperative fluoroscopic images, bridging the domain gap between synthetic and real data for improved surgical navigation in orthopedic surgeries.


Simple Policy Optimization

http://arxiv.org/abs/2401.16025v1

Compressor summary: The paper introduces SPO, a new algorithm that improves on PPO by using a better clipping method for KL divergence, which enhances stability and performance in reinforcement learning environments.


Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures

http://arxiv.org/abs/2401.16024v1

Compressor summary: The study proposes a method to learn vector-symbolic architectures rules for Raven's progressive matrices, improving abstract reasoning in artificial intelligence and outperforming connectionist models.


Finding Challenging Metaphors that Confuse Pretrained Language Models

http://arxiv.org/abs/2401.16012v1

Compressor summary: The paper investigates challenging metaphors for NLP models and proposes an automatic pipeline to identify them, showing significant drops in performance on downstream tasks.


GPS: Graph Contrastive Learning via Multi-scale Augmented Views from Adversarial Pooling

http://arxiv.org/abs/2401.16011v1

Compressor summary: GPS is a new graph contrastive learning approach that uses graph pooling to generate multi-scale positive views, improving representation learning performance on graphs.


AccessLens: Auto-detecting Inaccessibility of Everyday Objects

http://arxiv.org/abs/2401.15996v1

Compressor summary: AccessLens is a system that uses machine learning and 3D printing to identify and solve everyday physical interface barriers for diverse people.


Deep Embedding Clustering Driven by Sample Stability

http://arxiv.org/abs/2401.15989v1

Compressor summary: DECS is a deep embedding clustering algorithm without pseudo targets that uses sample stability to pull samples to their clusters and outperforms existing methods.


Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling

http://arxiv.org/abs/2401.15987v1

Compressor summary: Our proposed data-driven method refines coarse hand motion for interacting with objects in virtual reality and robotics, using a hand-centric representation and a new hierarchical architecture.


Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

http://arxiv.org/abs/2401.15977v1

Compressor summary: Motion-I2V is a framework for generating videos from images that can handle large motion variations, support precise control over motion trajectories, and enable zero-shot video translation.


StableIdentity: Inserting Anybody into Anywhere at First Sight

http://arxiv.org/abs/2401.15975v1

Compressor summary: StableIdentity is a method that can insert a person's identity into various contexts using just one face image, achieving high-quality human-centric generation while preserving the identity.


Sample Weight Estimation Using Meta-Updates for Online Continual Learning

http://arxiv.org/abs/2401.15973v1

Compressor summary: The paper proposes OMSI, a method that adapts sample weights for each sample in a mini-batch to improve continual learning performance and accuracy.


Routers in Vision Mixture of Experts: An Empirical Study

http://arxiv.org/abs/2401.15969v1

Compressor summary: The paper studies different routers for Mixture-of-Experts models in computer vision tasks and finds that Expert Choice routers, soft MoEs, and adapting language model routers perform better.


Response Generation for Cognitive Behavioral Therapy with Large Language Models: Comparative Study with Socratic Questioning

http://arxiv.org/abs/2401.15966v1

Compressor summary: The study uses GPT-4 and OsakaED, two large language models, to generate counseling dialogues based on cognitive behavioral therapy scenarios, and finds that GPT-4 significantly improves mood, empathy, and other dialogue qualities compared to a scenario-based system.


Spatio-Temporal Attention Graph Neural Network for Remaining Useful Life Prediction

http://arxiv.org/abs/2401.15964v1

Compressor summary: The Spatio-Temporal Attention Graph Neural Network combines graph and temporal convolutional neural networks for predicting industrial system lifespans, improving precision and explainability with multi-head attention.


A Class-aware Optimal Transport Approach with Higher-Order Moment Matching for Unsupervised Domain Adaptation

http://arxiv.org/abs/2401.15952v1

Compressor summary: The paper introduces a new unsupervised domain adaptation method called class-aware optimal transport, which uses deep neural networks and higher-order moment matching to transfer knowledge from labeled source domain to unlabeled target domain.


TFDMNet: A Novel Network Structure Combines the Time Domain and Frequency Domain Features

http://arxiv.org/abs/2401.15949v1

Compressor summary: The paper proposes a new layer called EML to replace convolution layers in CNNs, reducing computation complexity and enabling parallelization, and introduces TFDMNet, a network structure that combines the advantages of both EMLs and convolution layers.


AdvNF: Reducing Mode Collapse in Conditional Normalising Flows using Adversarial Learning

http://arxiv.org/abs/2401.15948v1

Compressor summary: The paper explores issues with conditional normalising flows and suggests using adversarial training to improve them, testing the method on synthetic and real data.


MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

http://arxiv.org/abs/2401.15947v1

Compressor summary: The authors propose MoE-tuning, a training strategy to create sparse large vision-language models that can improve performance and reduce hallucinations with constant computational cost.


Bridging the Domain Gap: A Simple Domain Matching Method for Reference-based Image Super-Resolution in Remote Sensing

http://arxiv.org/abs/2401.15944v1

Compressor summary: The paper proposes a Domain Matching module to enhance reference-based image super-resolution models in real-world scenarios with domain gaps, such as satellite imaging, where existing models struggle.


Generating Multi-Center Classifier via Conditional Gaussian Distribution

http://arxiv.org/abs/2401.15942v1

Compressor summary: The paper proposes a novel multi-center linear classifier for image classification that samples multiple sub-centers from conditional Gaussian distributions to capture intra-class local structures more efficiently.


Motion-induced error reduction for high-speed dynamic digital fringe projection system

http://arxiv.org/abs/2401.15938v1

Compressor summary: The paper proposes a method to reduce motion-induced errors in 3D shape measurement using fringe patterns by leveraging encoder data and camera-projector geometry, requiring low cost and easy implementation.


Self-Supervised Learning in Event Sequences: A Comparative Study and Hybrid Approach of Generative Modeling and Contrastive Learning

http://arxiv.org/abs/2401.15935v1

Compressor summary: The study explores combining generative and contrastive self-supervised learning techniques for event sequence representation and shows their universal benefits across various applications.


HICH Image/Text (HICH-IT): Comprehensive Text and Image Datasets for Hypertensive Intracerebral Hemorrhage Research

http://arxiv.org/abs/2401.15934v1

Compressor summary: The paper introduces HICH-IT, a new multimodal dataset for hypertensive intracerebral hemorrhage that includes textual information and head CT images to improve AI accuracy in diagnosis and treatment.


E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

http://arxiv.org/abs/2401.15927v1

Compressor summary: The text introduces E-EVAL, a benchmark for evaluating large language models in the Chinese K-12 education domain, which reveals their strengths and limitations across various subjects.


Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

http://arxiv.org/abs/2401.15914v1

Compressor summary: The paper proposes a novel approach called OGEN that improves out-of-distribution generalization of vision-language models by using a feature generator and an adaptive self-distillation mechanism to regularize the model.


Distribution-consistency Structural Causal Models

http://arxiv.org/abs/2401.15911v1

Compressor summary: The paper proposes Distribution-consistency Structural Causal Models (DiscoSCMs) to better model counterfactuals in causal frameworks, introducing a new parameter and theoretical results.


Toward the Identifiability of Comparative Deep Generative Models

http://arxiv.org/abs/2401.15903v1

Compressor summary: The authors propose a theory for identifying latent representations in comparative deep generative models, which can improve analysis of patterns between different data sets using piece-wise affine mixing functions and constrained optimization.


A Concise but Effective Network for Image Guided Depth Completion in Autonomous Driving

http://arxiv.org/abs/2401.15902v1

Compressor summary: CENet is a simple network that fuses RGB and depth images for autonomous driving using a fast guidance module, a decoupled prediction head, and achieves competitive performance on benchmarks.


MV2MAE: Multi-View Video Masked Autoencoders

http://arxiv.org/abs/2401.15900v1

Compressor summary: Key points: - The paper presents a self-supervised method for cross-view video reconstruction using masked autoencoder (MAE) and cross-attention mechanism - The method leverages geometry information to improve robustness to viewpoint changes and temporal modeling of static regions - The method achieves state-of-the-art results on various datasets Summary: The paper proposes a self-supervised MAE framework for cross-view video reconstruction that uses geometry information and cross-attention to improve robustness and temporal modeling, and achieves state-of-the-art performance.


$\boldsymbol{M^2}$-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

http://arxiv.org/abs/2401.15896v1

Compressor summary: The text introduces a large bilingual dataset and a new pretraining method to improve vision-language models' understanding of images in both Chinese and English, achieving state-of-the-art results on multimodal tasks.


A Gated MLP Architecture for Learning Topological Dependencies in Spatio-Temporal Graphs

http://arxiv.org/abs/2401.15894v1

Compressor summary: Cy2Mixer is a novel spatio-temporal GNN that leverages topological non-trivial invariants and gated multi-layer perceptrons to capture spatial, temporal, and topological information in graphs for traffic analysis.


Arbitrary-Scale Downscaling of Tidal Current Data Using Implicit Continuous Representation

http://arxiv.org/abs/2401.15893v1

Compressor summary: The paper presents a novel downscaling framework for tidal current data that uses deep learning, addresses its unique characteristics, and reduces computational cost.


Grey Level Texture Features for Segmentation of Chromogenic Dye RNAscope From Breast Cancer Tissue

http://arxiv.org/abs/2401.15886v1

Compressor summary: The paper explores how gray level texture features can be used to automatically segment and classify RNAscope transcripts in breast cancer tissue, potentially improving the speed and accuracy of diagnosis and treatment.


Rectify the Regression Bias in Long-Tailed Object Detection

http://arxiv.org/abs/2401.15885v1

Compressor summary: The paper proposes solutions to improve long-tailed object detection by addressing the regression bias caused by class-specific regression heads for rare categories, achieving state-of-the-art results on LVIS dataset.


Corrective Retrieval Augmented Generation

http://arxiv.org/abs/2401.15884v1

Compressor summary: The paper proposes a method called CRAG to enhance the robustness of text generation by using web searches, confidence evaluation, and selective information extraction.


lil'HDoC: An Algorithm for Good Arm Identification under Small Threshold Gap

http://arxiv.org/abs/2401.15879v1

Compressor summary: lil'HDoC improves sample complexity for identifying good arms with small threshold gaps compared to HDoC and other algorithms.


Combining Satellite and Weather Data for Crop Type Mapping: An Inverse Modelling Approach

http://arxiv.org/abs/2401.15875v1

Compressor summary: WSTATT is a deep learning model that combines weather and satellite data to create accurate and early crop maps, accounting for physical processes of crop growth.


A Deep Q-Network Based on Radial Basis Functions for Multi-Echelon Inventory Management

http://arxiv.org/abs/2401.15872v1

Compressor summary: The paper proposes a Q-network based on radial basis functions for deep reinforcement learning to solve complex inventory management problems, demonstrating better performance than simple methods and current DRL approaches.


Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

http://arxiv.org/abs/2401.15866v1

Compressor summary: The authors explore using noisy labels to train networks for explainable machine learning tasks, resulting in faster and more cost-effective approximations compared to exact label training.


LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection

http://arxiv.org/abs/2401.15865v1

Compressor summary: The paper proposes LiDAR-PTQ, a Post-Training Quantization method for 3D lidar detection tasks that achieves state-of-the-art performance and speedup while being cost-effective.


Spatial Decomposition and Temporal Fusion based Inter Prediction for Learned Video Compression

http://arxiv.org/abs/2401.15864v1

Compressor summary: The paper proposes a spatial and temporal method to improve learned video compression by handling motion inconsistency and occlusion in local regions.


Importance-Aware Adaptive Dataset Distillation

http://arxiv.org/abs/2401.15863v1

Compressor summary: The paper introduces IADD, a method that improves dataset distillation by assigning importance weights to network parameters, resulting in more robust and generalizable distilled datasets.


DrBERT: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining

http://arxiv.org/abs/2401.15861v1

Compressor summary: The paper proposes DrBERT, a method that improves BERT's encoder by enhancing its decoder, leading to better natural language processing performance without increasing inference time or serving costs.


Diffusion Facial Forgery Detection

http://arxiv.org/abs/2401.15859v1

Compressor summary: DiFF is a new dataset with over 500,000 face-focused diffusion-generated images that aim to study and improve forgery detection methods.


Look Around! Unexpected gains from training on environments in the vicinity of the target

http://arxiv.org/abs/2401.15856v1

Compressor summary: The paper introduces Noise Injection method to study how Reinforcement Learning agents generalize when state transition probabilities change slightly in similar environments.


Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing

http://arxiv.org/abs/2401.15855v1

Compressor summary: The paper introduces Cross-Scale MAE, a self-supervised model for remote sensing image analysis that uses scale augmentation and cross-scale consistency constraints to learn consistent and meaningful representations.


LSTM-based Deep Neural Network With A Focus on Sentence Representation for Sequential Sentence Classification in Medical Scientific Abstracts

http://arxiv.org/abs/2401.15854v1

Compressor summary: The paper introduces a hierarchical deep learning model for the Sequential Sentence Classification task in medical abstracts, using sentence embeddings from an LSTM network and further enhancements from a C-RNN and MLP.


Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA

http://arxiv.org/abs/2401.15847v1

Compressor summary: The paper introduces Multipanel Visual Question Answering, a benchmark that challenges models in comprehending multipanel images with 6,600 questions and answers, revealing the limitations of current large vision language models.


Meta-Learning for Neural Network-based Temporal Point Processes

http://arxiv.org/abs/2401.15846v1

Compressor summary: The paper proposes a meta-learning approach using recurrent neural networks and monotonic neural networks to predict human activity events from short sequences, taking into account temporal periodic patterns.


LCVO: An Efficient Pretraining-Free Framework for Visual Question Answering Grounding

http://arxiv.org/abs/2401.15842v1

Compressor summary: The paper introduces a modular method using a large language model to improve Visual Question Answering Grounding tasks with low computational resources and various pre-trained models.


2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

http://arxiv.org/abs/2401.15841v1

Compressor summary: Key points: - The paper proposes a novel framework for 3D object reconstruction from a single image using multi-view images. - The framework addresses inconsistent lighting, misaligned geometry, and sparse views using three techniques: intrinsic decomposition guidance, transient-mono prior guidance, and view augmentation. - The framework achieves superior performance compared to the state-of-the-art method Syncdreamer. Summary: The paper presents a new 3D reconstruction framework that uses multi-view images and tackles lighting, geometry, and view issues with three guidelines and fusion strategies, outperforming the current best method.


Emergent Explainability: Adding a causal chain to neural network inference

http://arxiv.org/abs/2401.15840v1

Compressor summary: The paper proposes a new framework to make AI more transparent and interpretable by integrating emergent communication into AI models, enabling causal understanding of outputs.


Few and Fewer: Learning Better from Few Examples Using Fewer Base Classes

http://arxiv.org/abs/2401.15834v1

Compressor summary: This paper explores how using fewer base classes for feature extraction can improve few-shot learning and presents simple, intuitive methods that can be applied to any few-shot solution.


Knowledge-Aware Neuron Interpretation for Scene Classification

http://arxiv.org/abs/2401.15820v1

Compressor summary: The text proposes a novel framework for interpreting neural models in image scene classification using knowledge graph-based methods, addressing limitations in concept completeness, fusion, and manipulation.


OntoMedRec: Logically-Pretrained Model-Agnostic Ontology Encoders for Medication Recommendation

http://arxiv.org/abs/2401.15814v1

Compressor summary: OntoMedRec is a model that uses medical ontologies to improve medication recommendation by addressing data sparsity issues in electronic health records.