arxiv compressed, 2023-12-05

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2023-12-05 generated by the compressor, my personal LLM-based project.

PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

Anh-Quan Cao,Angela Dai,Raoul de Charette

http://arxiv.org/abs/2312.02158v1

Compressor summary: The paper proposes a new task called Panoptic Scene Completion (PSC) that adds instance-level information to the Semantic Scene Completion (SSC) task, and introduces a method to estimate uncertainties using a multi-input multi-output strategy.

Mesh-Guided Neural Implicit Field Editing

Can Wang,Mingming He,Menglei Chai,Dongdong Chen,Jing Liao

http://arxiv.org/abs/2312.02157v1

Compressor summary: The paper proposes a new approach that combines mesh guiding and neural implicit fields for easy editing of 3D scenes while maintaining high-quality rendering.

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Shunyuan Zheng,Boyao Zhou,Ruizhi Shao,Boning Liu,Shengping Zhang,Liqiang Nie,Yebin Liu

http://arxiv.org/abs/2312.02155v1

Compressor summary: The GPS-Gaussian method allows real-time, high-resolution view synthesis of characters without fine-tuning or optimization by using Gaussian parameter maps and training on human scan data.

Latent Feature-Guided Diffusion Models for Shadow Removal

Kangfu Mei,Luis Figueroa,Zhe Lin,Zhihong Ding,Scott Cohen,Vishal M. Patel

http://arxiv.org/abs/2312.02156v1

Compressor summary: The paper proposes using diffusion models to gradually refine shadow regions and improve texture recovery, conditioning on a learned latent feature space and fusing noise features with the network, achieving significant improvements over previous methods.

Aligning and Prompting Everything All at Once for Universal Visual Perception

Yunhang Shen,Chaoyou Fu,Peixian Chen,Mengdan Zhang,Ke Li,Xing Sun,Yunsheng Wu,Shaohui Lin,Rongrong Ji

http://arxiv.org/abs/2312.02153v1

Compressor summary: The paper introduces APE, a universal visual perception model that performs diverse tasks like detection, segmentation, and grounding in images using instance-level sentence-object matching and without task-specific fine-tuning.

Steerers: A framework for rotation equivariant keypoint descriptors

Georg Bökman,Johan Edstedt,Michael Felsberg,Fredrik Kahl

http://arxiv.org/abs/2312.02152v1

Compressor summary: The authors propose a method to learn a linear transform called a steerer that encodes rotations in image keypoint descriptions, improving their robustness to camera rotation without sacrificing performance or runtime.

Guarding Barlow Twins Against Overfitting with Mixed Samples

Wele Gedara Chaminda Bandara,Celso M. De Melo,Vishal M. Patel

http://arxiv.org/abs/2312.02151v1

Compressor summary: Mixed Barlow Twins is a method to improve self-supervised learning by enhancing sample interaction and reducing feature overfitting, leading to better downstream task performance.

Readout Guidance: Learning Control from Diffusion Features

Grace Luo,Trevor Darrell,Oliver Wang,Dan B Goldman,Aleksander Holynski

http://arxiv.org/abs/2312.02150v1

Compressor summary: Readout Guidance is a method that uses lightweight networks to guide text-to-image diffusion models with user-defined targets, requiring fewer parameters and training samples than prior methods.

Generative Powers of Ten

Xiaojuan Wang,Janne Kontkanen,Brian Curless,Steve Seitz,Ira Kemelmacher,Ben Mildenhall,Pratul Srinivasan,Dor Verbin,Aleksander Holynski

http://arxiv.org/abs/2312.02149v1

Compressor summary: The text describes a method for creating images that can be zoomed in on extensively while maintaining consistency across different scales using a joint multi-scale diffusion sampling approach.

Rejuvenating image-GPT as Strong Visual Representation Learners

Sucheng Ren,Zeyu Wang,Hongru Zhu,Junfei Xiao,Alan Yuille,Cihang Xie

http://arxiv.org/abs/2312.02147v1

Compressor summary: The paper presents D-iGPT, a modified version of image-GPT that uses semantic tokens and predicts both visible and hidden tokens, achieving strong visual representation learning results on ImageNet-1K and other tasks.

Learning Polynomial Problems with $SL(2,\mathbb{R})$ Equivariance

Hannah Lawrence,Mitchell Tong Harris

http://arxiv.org/abs/2312.02146v1

Compressor summary: Neural networks can solve positivity optimization and certification problems for polynomials faster and accurately, using data-driven methods and adapting to non-compact group equivariant structures.

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke,Anton Obukhov,Shengyu Huang,Nando Metzger,Rodrigo Caye Daudt,Konrad Schindler

http://arxiv.org/abs/2312.02145v1

Compressor summary: Marigold is a method for monocular depth estimation that uses generative diffusion models to improve generalization and achieve state-of-the-art performance with synthetic training data.

Optimizing Camera Configurations for Multi-View Pedestrian Detection

Yunzhong Hou,Xingjian Leng,Tom Gedeon,Liang Zheng

http://arxiv.org/abs/2312.02144v1

Compressor summary: The text describes a novel solution that uses reinforcement learning and transformers to autonomously generate camera configurations for multi-view pedestrian detection systems, achieving better results than existing methods.

Competition-Level Problems Are Effective Evaluators of LLMs

Yiming Huang,Zhenghao Lin,Xiao Liu,Yeyun Gong,Shuai Lu,Fangyu Lei,Yaobo Liang,Yelong Shen,Chen Lin,Nan Duan,Weizhu Chen

http://arxiv.org/abs/2312.02143v1

Compressor summary: The paper evaluates GPT-4's reasoning skills on Codeforces problems, finding a decline in performance after September 2021, suggesting data contamination and challenges for existing LLMs to solve complex reasoning tasks.

Object Recognition as Next Token Prediction

Kaiyu Yue,Bor-Chun Chen,Jonas Geiping,Hengduo Li,Tom Goldstein,Ser-Nam Lim

http://arxiv.org/abs/2312.02142v1

Compressor summary: The paper proposes an efficient object recognition method using a language decoder to predict text tokens from image embeddings, with a custom attention mask and one-shot sampling for parallel label generation.

iMatching: Imperative Correspondence Learning

Zitong Zhan,Dasong Gao,Yun-Jou Lin,Youjie Xia,Chen Wang

http://arxiv.org/abs/2312.02141v1

Compressor summary: The paper introduces imperative learning (IL), a self-supervised scheme for training feature correspondence in computer vision, which improves performance on tasks like feature matching and pose estimation.

DiffiT: Diffusion Vision Transformers for Image Generation

Ali Hatamizadeh,Jiaming Song,Guilin Liu,Jan Kautz,Arash Vahdat

http://arxiv.org/abs/2312.02139v1

Compressor summary: This paper proposes Diffusion Vision Transformers (DiffiT), a hybrid hierarchical architecture with a U-shaped encoder and decoder, that uses time-dependent self-attention for efficient denoising in diffusion-based generative learning, achieving state-of-the-art results on various image synthesis tasks.

MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians

Chandradeep Pokhariya,Ishaan N Shah,Angela Xing,Zekun Li,Kefan Chen,Avinash Sharma,Srinath Sridhar

http://arxiv.org/abs/2312.02137v1

Compressor summary: MANUs is a novel method for capturing hand-object grasps using articulated 3D Gaussians, which enables accurate estimation of contacts between hands and objects, and requires tens of camera views.

BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation

Qihang Zhang,Yinghao Xu,Yujun Shen,Bo Dai,Bolei Zhou,Ceyuan Yang

http://arxiv.org/abs/2312.02136v1

Compressor summary: The paper proposes a new way to generate large-scale 3D scenes using an equivariant radiance field and a bird's-eye view map, which allows for easy manipulation of objects and smooth stitching of local scenes.

Fast View Synthesis of Casual Videos

Yao-Chih Lee,Zhoutong Zhang,Kevin Blackburn-Matzen,Simon Niklaus,Jianming Zhang,Jia-Bin Huang,Feng Liu

http://arxiv.org/abs/2312.02135v1

Compressor summary: The paper proposes a fast and efficient method to synthesize novel views from monocular videos using explicit video representations, treating static and dynamic content separately.

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Liangxiao Hu,Hongwen Zhang,Yuxiang Zhang,Boyao Zhou,Boning Liu,Shengping Zhang,Liqiang Nie

http://arxiv.org/abs/2312.02134v1

Compressor summary: GaussianAvatar creates realistic human avatars with dynamic 3D appearances from a single video using animatable 3D Gaussians and pose-dependent appearance modeling.

Style Aligned Image Generation via Shared Attention

Amir Hertz,Andrey Voynov,Shlomi Fruchter,Daniel Cohen-Or

http://arxiv.org/abs/2312.02133v1

Compressor summary: StyleAligned is a technique for maintaining consistent style across generated images using text-to-image models by sharing attention during the diffusion process.

Hot PATE: Private Aggregation of Distributions for Diverse Task

Edith Cohen,Xin Lyu,Jelani Nelson,Tamas Sarlos,Uri Stemmer

http://arxiv.org/abs/2312.02132v1

Compressor summary: Hot PATE is a method for transferring knowledge from multiple teacher models to a student model while preserving privacy, focusing on diverse tasks where there may not be a clear label for each example.

Can we truly transfer an actor's genuine happiness to avatars? An investigation into virtual, real, posed and spontaneous faces

Vitor Miguel Xavier Peres,Greice Pinho Dal Molin,Soraia Raupp Musse

http://arxiv.org/abs/2312.02128v1

Compressor summary: The research evaluates Ekman's action units in real and virtual human faces, posed and spontaneous, to find differences and similarities that can help various fields of knowledge.

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

Nikhil Keetha,Jay Karhade,Krishna Murthy Jatavallabhula,Gengshan Yang,Sebastian Scherer,Deva Ramanan,Jonathon Luiten

http://arxiv.org/abs/2312.02126v1

Compressor summary: The paper introduces SplaTAM, a method that uses 3D Gaussians to enable dense SLAM with a single unposed monocular RGB-D camera, improving performance in pose estimation, map construction, and novel-view synthesis while allowing real-time rendering.

TPPoet: Transformer-Based Persian Poem Generation using Minimal Data and Advanced Decoding Techniques

Amir Panahandeh,Hanie Asemi,Esmail Nourani

http://arxiv.org/abs/2312.02125v1

Compressor summary: The study trains a Persian classical poetry generation model using a transformer architecture on a specialized dataset without pretraining, and proposes a novel decoding method to enhance coherence and meaningfulness in the generated poetry.

VerA: Versatile Anonymization Fit for Clinical Facial Images

Majed El Helou,Doruk Cetin,Petar Stamenkovic,Fabio Zund

http://arxiv.org/abs/2312.02124v1

Compressor summary: VerA is a versatile facial image anonymization method that preserves semantic areas for medical intervention and works well on before-and-after results, outperforming or matching existing methods in regular images.

Magicoder: Source Code Is All You Need

Yuxiang Wei,Zhe Wang,Jiawei Liu,Yifeng Ding,Lingming Zhang

http://arxiv.org/abs/2312.02120v1

Compressor summary: Magicoder is an open-source LLM for code that uses OSS-Instruct to generate diverse and realistic instruction data from open-source code snippets, resulting in high performance on coding benchmarks.

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

Anay Mehrotra,Manolis Zampetakis,Paul Kassianik,Blaine Nelson,Hyrum Anderson,Yaron Singer,Amin Karbasi

http://arxiv.org/abs/2312.02119v1

Compressor summary: The paper introduces TAP, an automated method that uses tree-of-thoughts reasoning to generate jailbreaks for LLMs with only black-box access and minimal queries, outperforming previous methods.

When it Rains, it Pours: Modeling Media Storms and the News Ecosystem

Benjamin Litterer,David Jurgens,Dallas Card

http://arxiv.org/abs/2312.02118v1

Compressor summary: The authors develop a method to identify media storms from online news articles and study their characteristics and effects on media coverage and agenda setting.

GIVT: Generative Infinite-Vocabulary Transformers

Michael Tschannen,Cian Eastwood,Fabian Mentzer

http://arxiv.org/abs/2312.02116v1

Compressor summary: The paper introduces GIVT, which are transformers that generate real-valued vector sequences from infinite vocabularies, and shows their applications in image generation and other tasks.

TriDeNT: Triple Deep Network Training for Privileged Knowledge Distillation in Histopathology

Lucas Farndale,Robert Insall,Ke Yuan

http://arxiv.org/abs/2312.02111v1

Compressor summary: TriDeNT is a self-supervised method that uses additional data unavailable during inference to improve computational pathology models' performance on various tasks.

ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation

Dar-Yen Chen,Hamish Tennent,Ching-Wen Hsu

http://arxiv.org/abs/2312.02109v1

Compressor summary: ArtAdapter is a new text-to-image framework that transfers high-level artistic elements from text to image with unprecedented fidelity and no content borrowing, outperforming existing methods.

Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection

Sunghun Kang,Junbum Cha,Jonghwan Mun,Byungseok Roh,Chang D. Yoo

http://arxiv.org/abs/2312.02103v1

Compressor summary: The paper presents PLAC, a method to learn image-to-text mapping for arbitrary concepts in open-vocabulary object detection.

Single-sample versus case-control sampling scheme for Positive Unlabeled data: the story of two scenarios

Jan Mielniczuk,Adam Wawrzeńczyk

http://arxiv.org/abs/2312.02095v1

Compressor summary: The paper compares different classifiers for positive unlabeled data in case-control and single-sample scenarios and shows that their performance can vary significantly depending on the scenario.

Physics simulation capabilities of LLMs

Mohamad Ali-Dib,Kristen Menou

http://arxiv.org/abs/2312.02091v1

Compressor summary: Key points: - LLMs can solve physics problems and code, but struggle with PhD-level computational physics problems - The paper evaluates LLMs on 50 original problems in different domains using various packages - GPT4 fails most problems, but produces mostly correct lines with some physics and coding errors Summary: The paper tests the ability of large language models to solve complex physics problems using code packages and finds that GPT4 fails most of them, but produces mostly correct lines with some physics and coding errors.

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Yuchao Gu,Yipin Zhou,Bichen Wu,Licheng Yu,Jia-Wei Liu,Rui Zhao,Jay Zhangjie Wu,David Junhao Zhang,Mike Zheng Shou,Kevin Tang

http://arxiv.org/abs/2312.02087v1

Compressor summary: The VideoSwap framework uses semantic point correspondences instead of dense correspondences for shape-preserving video subject swapping with user-friendly interactions.

Deep Set Neural Networks for forecasting asynchronous bioprocess timeseries

Maxim Borisyak,Stefan Born,Peter Neubauer,Nicolás Cruz-Bournazou

http://arxiv.org/abs/2312.02079v1

Compressor summary: The authors propose a deep learning method that can handle sparse and irregular time series from bio-process data without needing imputation or alignment procedures, and demonstrate its effectiveness in forecasting tasks.

A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

Giovanni Monea,Maxime Peyrard,Martin Josifoski,Vishrav Chaudhary,Jason Eisner,Emre Kıcıman,Hamid Palangi,Barun Patra,Robert West

http://arxiv.org/abs/2312.02073v1

Compressor summary: The paper introduces Fakepedia, a dataset to study how large language models ground their knowledge in contradictory situations, and analyzes the differences between GPT-4-turbo and Mistral-7B in this context.

GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians

Shenhan Qian,Tobias Kirschstein,Liam Schoneveld,Davide Davoli,Simon Giebenhain,Matthias Nießner

http://arxiv.org/abs/2312.02069v1

Compressor summary: GaussianAvatars is a new method for creating realistic and controllable 3D head avatars using Gaussian splats and a parametric morphable face model.

Know Your Audience: Do LLMs Adapt to Different Age and Education Levels?

Donya Rooein,Amanda Cercas Curry,Dirk Hovy

http://arxiv.org/abs/2312.02065v1

Compressor summary: The paper evaluates the readability of answers generated by large language models (LLMs) for science questions targeting different age groups and education levels, finding that LLMs need improvement to better adapt to diverse audiences in educational settings.

DUCK: Distance-based Unlearning via Centroid Kinematics

Marco Cotogni,Jacopo Bonato,Luigi Sabetta,Francesco Pelosin,Alessandro Nicolosi

http://arxiv.org/abs/2312.02052v1

Compressor summary: DUCK is a new unlearning algorithm that uses metric learning to remove specific samples and achieve state-of-the-art performance in ensuring privacy in AI models.

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren,Linli Yao,Shicheng Li,Xu Sun,Lu Hou

http://arxiv.org/abs/2312.02051v1

Compressor summary: TimeChat is a multimodal language model that can understand long videos by processing their visual and temporal information and following instructions from users.

GFS: Graph-based Feature Synthesis for Prediction over Relational Databases

Han Zhang,Quan Gan,David Wipf,Weinan Zhang

http://arxiv.org/abs/2312.02037v1

Compressor summary: The paragraph describes a novel framework called Graph-based Feature Synthesis (GFS) that uses heterogeneous graphs to train machine learning models on multi-table relational databases without feature engineering, preserving the data's inherent relationships and structure.

Implicit Learning of Scene Geometry from Poses for Global Localization

Mohammad Altillawi,Shile Li,Sai Manoj Prakhya,Ziyuan Liu,Joan Serrat

http://arxiv.org/abs/2312.02029v1

Compressor summary: The paper proposes a learning method for global visual localization that uses minimal pose labels to learn 3D scene geometry and improve pose estimation accuracy using rigid alignment and additional learning constraints.

VLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic Segmentation

Christoph Hümmer,Manuel Schwonberg,Liangwei Zhong,Hu Cao,Alois Knoll,Hanno Gottschalk

http://arxiv.org/abs/2312.02021v1

Compressor summary: VLTSeg is a vision-language method that improves domain generalization in semantic segmentation by using CLIP and EVA-CLIP encoders, outperforming previous approaches on several benchmarks.

Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models

Xingyuan Zhang,Philip Becker-Ehmck,Patrick van der Smagt,Maximilian Karl

http://arxiv.org/abs/2312.02019v1

Compressor summary: The paper introduces AIME, a method that enables agents to learn new behaviors by imitating expert demonstrations without needing further training or environment interactions.

ColonNeRF: Neural Radiance Fields for High-Fidelity Long-Sequence Colonoscopy Reconstruction

Yufei Shi,Beijia Lu,Jia-Wei Liu,Ming Li,Mike Zheng Shou

http://arxiv.org/abs/2312.02015v1

Compressor summary: The ColonNeRF framework uses neural rendering to reconstruct the entire colon in a piecewise manner, addressing challenges like shape dissimilarity, geometry complexity, and sparse views for accurate long-sequence colonoscopy reconstruction.

Optimal Data Generation in Multi-Dimensional Parameter Spaces, using Bayesian Optimization

M. R. Mahani,Igor A. Nechepurenko,Yasmin Rahimof,Andreas Wicht

http://arxiv.org/abs/2312.02012v1

Compressor summary: The text proposes a method for creating a minimal yet informative database using Bayesian optimization and Gaussian process regression to train accurate machine learning models with less data points.

Towards Learning a Generalist Model for Embodied Navigation

Duo Zheng,Shijia huang,Lin Zhao,Yiwu Zhong,Liwei Wang

http://arxiv.org/abs/2312.02010v1

Compressor summary: The paper introduces NaviLLM, a generalist AI model for embodied navigation that adapts large language models to various tasks using schema-based instructions and achieves state-of-the-art performance and generalizability.

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Geonmo Gu,Sanghyuk Chun,Wonjae Kim,Yoohoon Kang,Sangdoo Yun

http://arxiv.org/abs/2312.01998v1

Compressor summary: The proposed LinCIR framework trains a composed image retrieval model using only text datasets with a self-supervision technique called self-masking projection, achieving high performance on four benchmarks and outperforming some supervised methods.

A Generative Self-Supervised Framework using Functional Connectivity in fMRI Data

Jungwon Choi,Seongho Keum,EungGu Yun,Byung-Hoon Kim,Juho Lee

http://arxiv.org/abs/2312.01994v1

Compressor summary: The authors propose a generative self-supervised learning method for graph neural networks to improve accuracy and interpretability in modeling dynamic functional connectivity from fMRI data, addressing challenges such as high data cost and limited generalization.

Information Modified K-Nearest Neighbor

Mohammad Ali Vahedifar,Azim Akhtarshenas,Mariam Sabbaghian,Mohammad Rafatpanah

http://arxiv.org/abs/2312.01991v1

Compressor summary: The paper introduces IMKNN, a novel method that improves the KNN algorithm by using Mutual Information and Shapley values to assign weights to neighbors, and shows its superior performance in various classification tasks.

Bootstrapping SparseFormers from Vision Foundation Models

Ziteng Gao,Zhan Tong,Kevin Qinghong Lin,Joya Chen,Mike Zheng Shou

http://arxiv.org/abs/2312.01987v1

Compressor summary: The paper proposes a method to bootstrap SparseFormer architectures from ViT-based vision foundation models, reducing computational costs and enabling zero-shot performance with minimal training samples.

UniGS: Unified Representation for Image Generation and Segmentation

Lu Qi,Lehan Yang,Weidong Guo,Yu Xu,Bo Du,Varun Jampani,Ming-Hsuan Yang

http://arxiv.org/abs/2312.01985v1

Compressor summary: The paper presents a new representation for diffusion models that enables image generation, segmentation, and adaptation to various tasks with efficient modules and inpainting pipeline.

Semantics-aware Motion Retargeting with Vision-Language Models

Haodong Zhang,ZhiKe Chen,Haocheng Xu,Lei Hao,Xiaofei Wu,Songcen Xu,Zhensong Zhang,Yue Wang,Rong Xiong

http://arxiv.org/abs/2312.01964v1

Compressor summary: The SMT method uses vision-language models to extract and maintain meaningful motion semantics for motion retargeting between animation characters, with a two-stage pipeline that ensures preservation of both fine-grained details and high-level semantics.

Learning-Based Approaches to Predictive Monitoring with Conformal Statistical Guarantees

Francesca Cairoli,Luca Bortolussi,Nicola Paoletti

http://arxiv.org/abs/2312.01959v1

Compressor summary: This tutorial explains how to use machine learning and conformal prediction to efficiently predict and monitor future violations of requirements in complex systems.

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

Victor Gallego

http://arxiv.org/abs/2312.01957v1

Compressor summary: The paper presents dSC, a method for improving LLM outputs using synthetic data and Bayesian inference, and shows its potential in various tasks.

Zero- and Few-Shots Knowledge Graph Triplet Extraction with Large Language Models

Andrea Papaluca,Daniel Krefl,Sergio Mendez Rodriguez,Artem Lensky,Hanna Suominen

http://arxiv.org/abs/2312.01954v1

Compressor summary: The authors evaluate how Large Language Models (LLMs) use contextual information from a Knowledge Base (KB) to perform Triplet Extraction (TE) in Zero- and Few-Shots settings, finding that the quality of the KB context strongly affects TE performance.

Instance-guided Cartoon Editing with a Large-scale Dataset

Jian Lin,Chengze Li,Xueting Liu,Zhongping Ge

http://arxiv.org/abs/2312.01943v1

Compressor summary: The authors introduce a new dataset and model for accurately segmenting characters in cartoons, which enables various creative applications in cartoon editing.

Foundations for Transfer in Reinforcement Learning: A Taxonomy of Knowledge Modalities

Markus Wulfmeier,Arunkumar Byravan,Sarah Bechtle,Karol Hausman,Nicolas Heess

http://arxiv.org/abs/2312.01939v1

Compressor summary: The paragraph discusses how artificial intelligence systems are becoming more general and the challenges and opportunities in improving their knowledge representation and transfer across different domains using reinforcement learning modalities.

A Reliable Representation with Bidirectional Transition Model for Visual Reinforcement Learning Generalization

Xiaobo Hu,Youfang Lin,Yue Liu,Jinwen Wang,Shuo Wang,Hehe Fan,Kai Lv

http://arxiv.org/abs/2312.01915v1

Compressor summary: The BiT model leverages bidirectional prediction of environmental transitions to extract reliable representations for vision-based control tasks.

Unsupervised Anomaly Detection using Aggregated Normative Diffusion

Alexander Frotscher,Jaivardhan Kapoor,Thomas Wolfers,Christian F. Baumgartner

http://arxiv.org/abs/2312.01904v1

Compressor summary: ANDi is a new unsupervised anomaly detection method for brain MRI that outperforms existing approaches and can identify diverse types of anomalies better.

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Min Yang,Huan Gao,Ping Guo,Limin Wang

http://arxiv.org/abs/2312.01897v1

Compressor summary: The paper proposes a new mechanism for adapting pre-trained ViT models as a unified long-form video transformer to capture inter-snippet relations for temporal action detection in untrimmed videos, while maintaining low computation overhead and memory consumption.

Non-Intrusive Load Monitoring for Feeder-Level EV Charging Detection: Sliding Window-based Approaches to Offline and Online Detection

Cameron Martin,Fucai Ke,Hao Wang

http://arxiv.org/abs/2312.01887v1

Compressor summary: This paper presents a novel method for detecting electric vehicle (EV) charging at the feeder level using sliding-window feature extraction and machine learning techniques, achieving high accuracy in both offline and online detection.

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

Xunguang Wang,Zhenlan Ji,Pingchuan Ma,Zongjie Li,Shuai Wang

http://arxiv.org/abs/2312.01886v1

Compressor summary: The paper proposes InstructTA, a targeted adversarial attack on large vision-language models that uses a text-to-image model, GPT-4, and a local surrogate model to generate instruction-aware features and optimize the adversarial example.

Correlation and Unintended Biases on Univariate and Multivariate Decision Trees

Mattia Setzu,Salvatore Ruggieri

http://arxiv.org/abs/2312.01884v1

Compressor summary: The paragraph discusses how univariate and multivariate decision trees, which partition data differently, have similar performance despite the latter being more powerful, possibly due to dataset pre-processing bias.

Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario

Yimin Sun,Chao Wang,Yan Peng

http://arxiv.org/abs/2312.01882v1

Compressor summary: The paper introduces ZFDDA, a zero-shot VQA model for flood damage assessment, and FFD-IQA, a new dataset with diverse question types and more data to evaluate the model's performance.

HGPROMPT: Bridging Homogeneous and Heterogeneous Graphs for Few-shot Prompt Learning

Xingtong Yu,Zemin Liu,Yuan Fang,Xinming Zhang

http://arxiv.org/abs/2312.01878v1

Compressor summary: HGPROMPT is a novel framework that unifies pre-training and downstream tasks for homogeneous and heterogeneous graphs using dual-template design and dual-prompt to bridge the gap between them.

FeaInfNet: Diagnosis in Medical Image with Feature-Driven Inference and Visual Explanations

Yitao Peng,Lianghua He,Die Hu,Yihang Liu,Longzhen Yang,Shaohua Shang

http://arxiv.org/abs/2312.01871v1

Compressor summary: The paper proposes FeaInfNet, a model for interpretable medical image diagnosis that simulates doctors' reasoning process, uses local feature masks and adaptive dynamic masks to enhance expressivity and interpretability, and achieves state-of-the-art performance on multiple datasets.

Evaluating Dependencies in Fact Editing for Language Models: Specificity and Implication Awareness

Zichao Li,Ines Arous,Siva Reddy,Jackie C. K. Cheung

http://arxiv.org/abs/2312.01858v1

Compressor summary: The paragraph discusses a proposed evaluation protocol, DepEdit, for assessing the editing process of large language models (LLMs) with respect to logical constraints and implications of edited facts.

Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation

Joshua Niemeijer,Manuel Schwonberg,Jan-Aike Termöhlen,Nico M. Schmidt,Tim Fingscheidt

http://arxiv.org/abs/2312.01850v1

Compressor summary: The authors propose DIDEX, a diffusion-based domain extension method that generates diverse pseudo-target images with text prompts and trains a model to adapt towards them, achieving improved domain generalization results on various datasets without using target data.

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Xusen Sun,Longhao Zhang,Hao Zhu,Peng Zhang,Bang Zhang,Xinya Ji,Kangneng Zhou,Daiheng Gao,Liefeng Bo,Xun Cao

http://arxiv.org/abs/2312.01841v1

Compressor summary: VividTalk is a framework that generates high-quality talking head videos with lip-sync, expressive facial expressions, natural head pose, and high video quality by learning two motions in two stages.

Prompting Disentangled Embeddings for Knowledge Graph Completion with Pre-trained Language Model

Yuxia Geng,Jiaoyan Chen,Yuhang Zeng,Zhuo Chen,Wen Zhang,Jeff Z. Pan,Yuxiang Wang,Xiaoliang Xu

http://arxiv.org/abs/2312.01837v1

Compressor summary: The paper proposes a new Knowledge Graph Completion method (PDKGC) that uses prompts to train a frozen pre-trained language model, improving entity prediction by combining textual and structural information.

Few Clicks Suffice: Active Test-Time Adaptation for Semantic Segmentation

Longhui Yuan,Shuang Li,Zhuo He,Binhui Xie

http://arxiv.org/abs/2312.01835v1

Compressor summary: The paper proposes active test-time adaptation (ATASeg) for semantic segmentation, which uses a human-in-the-loop pattern to query few labels online and reduce the performance gap between unsupervised and supervised methods.

Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication

Zhangyue Yin,Qiushi Sun,Cheng Chang,Qipeng Guo,Junqi Dai,Xuanjing Huang,Xipeng Qiu

http://arxiv.org/abs/2312.01823v1

Compressor summary: The Exchange-of-Thought framework allows large language models to communicate with each other during problem-solving, improving their performance on complex reasoning tasks by incorporating external insights.

Learning Machine Morality through Experience and Interaction

Elizaveta Tennant,Stephen Hailes,Mirco Musolesi

http://arxiv.org/abs/2312.01818v1

Compressor summary: The paper explores different approaches to embedding morality in AI systems, argues that hybrid solutions combining hard-coded rules and learned preferences are needed, and presents three case studies using reinforcement learning to provide moral principles to agents.

Class Symbolic Regression: Gotta Fit 'Em All

Wassim Tenachi,Rodrigo Ibata,Thibaut L. François,Foivos I. Diakogiannis

http://arxiv.org/abs/2312.01816v1

Compressor summary: Class Symbolic Regression is a framework that finds a single function to fit multiple data sets with different parameters, using the idea that similar phenomena follow common laws. It improves on previous symbolic regression methods by integrating dimensional analysis and deep reinforcement learning, and shows its usefulness in astrophysics by finding an analytic galaxy potential from simulated orbits.

Energy-based Potential Games for Joint Motion Forecasting and Control

Christopher Diehl,Tobias Klosek,Martin Krüger,Nils Murzyn,Timo Osterburg,Torsten Bertram

http://arxiv.org/abs/2312.01811v1

Compressor summary: The authors propose a game-theoretic approach to model multi-agent interactions in robotics, combining energy-based models with neural networks for inference and optimization, which improves interpretability and predictive performance.

Collaborative Neural Painting

Nicola Dall'Asen,Willi Menapace,Elia Peruzzo,Enver Sangineto,Yiming Wang,Elisa Ricci

http://arxiv.org/abs/2312.01800v1

Compressor summary: The text describes a novel AI-based collaborative painting task that aims to produce coherent paintings with humans and machines, using parametrized strokes, attention mechanisms, and a new dataset.

Distributed Continual Learning with CoCoA in High-dimensional Linear Regression

Martin Hellkvist,Ayça Özçelikkale,Anders Ahlén

http://arxiv.org/abs/2312.01795v1

Compressor summary: Key points: - The paper studies estimation under time-varying signals in continual learning setting - It focuses on distributed learning using COCOA algorithm - It provides analytical characterization for generalization error of COCOA - It shows how network size, task similarity and number of tasks affect generalization error - It demonstrates the results with a digit classification task Summary: The paper analyzes how COCOA, a distributed learning algorithm, performs in continual learning with time-varying signals, and how to choose the network size to minimize the generalization error.

Wild-Tab: A Benchmark For Out-Of-Distribution Generalization In Tabular Regression

Sergey Kolesnikov

http://arxiv.org/abs/2312.01792v1

Compressor summary: Wild-Tab is a benchmark for testing out-of-distribution generalization in tabular regression tasks, using real-world datasets from weather prediction and power consumption estimation.

Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization

Konstantinos Triaridis,Vasileios Mezaris

http://arxiv.org/abs/2312.01790v1

Compressor summary: The paper presents two methods for merging the outputs of different filters to improve image manipulation localization and detection (IMLD), achieving competitive results compared to existing approaches.

Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world

Chengyin Hu,Weiwen Shi

http://arxiv.org/abs/2312.01789v1

Compressor summary: This paper introduces a novel attack method, TOUAP, for compromising cross-modal visible-infrared detectors in real-world scenarios using a two-stage optimization process involving an irregular polygonal infrared patch and a color QR code.

Developing Linguistic Patterns to Mitigate Inherent Human Bias in Offensive Language Detection

Toygar Tanyel,Besher Alkurdi,Serkan Ayvaz

http://arxiv.org/abs/2312.01787v1

Compressor summary: The paper proposes a data augmentation method to reduce human bias in offensive language detection on social media, aiming to improve accuracy and fairness in classifying offensive content across multiple languages.

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

Jiarui Xu,Yossi Gandelsman,Amir Bar,Jianwei Yang,Jianfeng Gao,Trevor Darrell,Xiaolong Wang

http://arxiv.org/abs/2312.01771v1

Compressor summary: The paper introduces IMProv, a generative model that learns visual tasks from textual and image prompts, achieving improvements in various computer vision tasks.

Localizing and Assessing Node Significance in Default Mode Network using Sub-Community Detection in Mild Cognitive Impairment

Ameiy Acharya,Chakka Sai Pradeep,Neelam Sinha

http://arxiv.org/abs/2312.01768v1

Compressor summary: The study uses fMRI and NSS to identify brain regions in the DMN that are most impacted by MCI, finding significant differences for PCC and Fusiform nodes.

Dynamic Erasing Network Based on Multi-Scale Temporal Features for Weakly Supervised Video Anomaly Detection

Chen Zhang,Guorong Li,Yuankai Qi,Hanhua Ye,Laiyun Qing,Ming-Hsuan Yang,Qingming Huang

http://arxiv.org/abs/2312.01764v1

Compressor summary: The paper proposes a Dynamic Erasing Network (DE-Net) that learns multi-scale temporal features for weakly supervised video anomaly detection, handling duration variations and encouraging the discovery of gentle abnormal segments.

CZL-CIAE: CLIP-driven Zero-shot Learning for Correcting Inverse Age Estimation

Yuntao Shou,Wei Ai,Tao Meng,Keqin Li

http://arxiv.org/abs/2312.01758v1

Compressor summary: The paper proposes a novel method called CZL-CIAE that leverages CLIP and FourierFormer to improve zero-shot age estimation from images and text, leading to better prediction results.

A Comprehensive Literature Review on Sweet Orange Leaf Diseases

Yousuf Rayhan Emon,Md Golam Rabbani,Dr. Md. Taimur Ahad,Faruk Ahmed

http://arxiv.org/abs/2312.01756v1

Compressor summary: The paragraph discusses a literature review of machine learning methods for detecting sweet orange leaf diseases using image classification techniques.

Long-Tail Learning with Rebalanced Contrastive Loss

Charika De Alvis,Dishanika Denipitiyage,Suranga Seneviratne

http://arxiv.org/abs/2312.01753v1

Compressor summary: Rebalanced Contrastive Learning (RCL) improves long tail classification by balancing feature space, reducing intra-class distance, and regularizing margins for imbalanced classes.

Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation

Qiaole Dong,Bo Zhao,Yanwei Fu

http://arxiv.org/abs/2312.01746v1

Compressor summary: The authors reproduce the closed-source DDVM model for image-to-image translation, making it open-source, and achieve comparable performance to the original with public data and GPUs.

Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval

Dixuan Lin,Yixing Peng,Jingke Meng,Wei-Shi Zheng

http://arxiv.org/abs/2312.01745v1

Compressor summary: The paper proposes a new method for person re-identification that builds fine bidirectional cross-modal associations between visual and textual modalities using adaptive dual association modules, ATP and ARA.

Fully Spiking Denoising Diffusion Implicit Models

Ryo Watanabe,Yusuke Mukuta,Tatsuya Harada

http://arxiv.org/abs/2312.01742v1

Compressor summary: The paper proposes a novel approach, FSDDIM, to create a diffusion model within spiking neural networks (SNNs) using synaptic current learning (SCL), which enables high-speed and low-energy image generation while maintaining the advantages of SNNs.

SRSNetwork: Siamese Reconstruction-Segmentation Networks based on Dynamic-Parameter Convolution

Bingkun Nian,Fenghe Tang,Jianrui Ding,Pingping Zhang,Jie Yang,S. Kevin Zhou,Wei Liu

http://arxiv.org/abs/2312.01741v1

Compressor summary: The paper introduces a new deep neural network for weak target image segmentation that leverages reconstruction tasks and outperforms existing methods on seven datasets.

Divide-and-Conquer Strategy for Large-Scale Dynamic Bayesian Network Structure Learning

Hui Ouyang,Cheng Chen,Ke Tang

http://arxiv.org/abs/2312.01739v1

Compressor summary: This paper introduces a novel divide-and-conquer strategy for large-scale Dynamic Bayesian Network structure learning, specifically focusing on Time-sliced Bayesian Networks, and shows substantial improvements in scalability, accuracy, and computational efficiency.

Effective Adapter for Face Recognition in the Wild

Yunhao Liu,Lu Qi,Yu-Ju Tsai,Xiangtai Li,Kelvin C. K. Chan,Ming-Hsuan Yang

http://arxiv.org/abs/2312.01734v1

Compressor summary: The paper proposes an adapter for face recognition models that processes both low-quality and enhanced images using dual-input structures to overcome the limitations of traditional approaches and achieve better performance in the wild.

Likelihood-Aware Semantic Alignment for Full-Spectrum Out-of-Distribution Detection

Fan Lu,Kai Zhu,Kecheng Zheng,Wei Zhai,Yang Cao

http://arxiv.org/abs/2312.01732v1

Compressor summary: The paper proposes a new method for detecting out-of-distribution samples in images and text that uses semantic alignment and likelihood-aware sampling to adapt to complex domain transformations.

EdgeConvFormer: Dynamic Graph CNN and Transformer based Anomaly Detection in Multivariate Time Series

Jie Liu,Qilin Li,Senjian An,Bradley Ezard,Ling Li

http://arxiv.org/abs/2312.01729v1

Compressor summary: EdgeConvFormer is a novel anomaly detection method for multivariate time series that combines Time2vec embedding, dynamic graph CNN, and Transformer to extract global and local spatial-time information and outperforms existing approaches on various real-world datasets.

ImputeFormer: Graph Transformers for Generalizable Spatiotemporal Imputation

Tong Nie,Guoyang Qin,Yuewen Mei,Jian Sun

http://arxiv.org/abs/2312.01728v1

Compressor summary: The paper proposes an effective and versatile deep neural model for multivariate time series imputation, which incorporates low-rank properties and achieves superior performance on various datasets.

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Jeongho Kim,Gyojung Gu,Minho Park,Sunghyun Park,Jaegul Choo

http://arxiv.org/abs/2312.01725v1

Compressor summary: Key points: - The paper proposes StableVITON, a method to do image-based virtual try-on using a pre-trained diffusion model. - The method learns the semantic correspondence between clothing and body in the latent space of the model. - The method uses zero cross-attention blocks, attention total variation loss, and augmentation to preserve clothing details and generate high-fidelity images. Summary: StableVITON is a novel virtual try-on method that leverages a pre-trained diffusion model to learn the semantic correspondence between clothing and body in the latent space and produce realistic images with sharp attention maps.

The Self-Loop Paradox: Investigating the Impact of Self-Loops on Graph Neural Networks

Moritz Lampert,Ingo Scholtes

http://arxiv.org/abs/2312.01721v1

Compressor summary: The self-loop paradox is a phenomenon where the information a node gains from itself can be smaller in graphs with self-loops compared to graphs without, depending on the GNN architecture, number of layers, and whether the layer number is even or odd.

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Bingshuai Liu,Chenyang Lyu,Zijun Min,Zhanyu Wang,Jinsong Su,Longyue Wang

http://arxiv.org/abs/2312.01714v1

Compressor summary: The paper proposes a method to improve multi-modal reasoning in LLMs by using retrieval mechanisms to select relevant examples, achieving state-of-the-art results on ScienceQA dataset.

Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection

Xubin Zhong,Changxing Ding,Yupeng Hu,Dacheng Tao

http://arxiv.org/abs/2312.01713v1

Compressor summary: The paper proposes Shunted Cross-Attention (SCA) and Interaction-aware Pose Estimation (IPE) to improve one-stage HOI detection by extracting disentangled interaction representations using different attention heads and a novel attention module for human pose features.

Regressor-Segmenter Mutual Prompt Learning for Crowd Counting

Mingyue Guo,Li Yuan,Zhaoyi Yan,Binghui Chen,Yaowei Wang,Qixiang Ye

http://arxiv.org/abs/2312.01711v1

Compressor summary: mPrompt is a method that uses both point and segmentation annotations to guide each other, reducing bias and improving accuracy in crowd counting tasks.

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites

Lei Wang,Jiabang He,Shenshen Li,Ning Liu,Ee-Peng Lim

http://arxiv.org/abs/2312.01701v1

Compressor summary: The paper introduces ReCaption, a framework to reduce fine-grained object hallucinations in instruction-tuned large vision-language models using ChatGPT and a new probing-based evaluation method called Fine-Grained Object Hallucination Evaluation.

Data Management For Large Language Models: A Survey

Zige Wang,Wanjun Zhong,Yufei Wang,Qi Zhu,Fei Mi,Baojun Wang,Lifeng Shang,Xin Jiang,Qun Liu

http://arxiv.org/abs/2312.01700v1

Compressor summary: The paragraph discusses the importance of data management in training Large Language Models, and provides a survey of current research and challenges in this field.

Rethinking Urban Mobility Prediction: A Super-Multivariate Time Series Forecasting Approach

Jinguo Cheng,Ke Li,Yuxuan Liang,Lijun Sun,Junchi Yan,Yuankai Wu

http://arxiv.org/abs/2312.01699v1

Compressor summary: SUMformer is a novel approach to urban mobility prediction that treats city data as complex multivariate time series and uses a special attention mechanism to capture temporal and cross-variable correlations, achieving better results than existing methods.

Hulk: A Universal Knowledge Translator for Human-Centric Tasks

Yizhou Wang,Yixuan Wu,Shixiang Tang,Weizhen He,Xun Guo,Feng Zhu,Lei Bai,Rui Zhao,Jian Wu,Tong He,Wanli Ouyang

http://arxiv.org/abs/2312.01697v1

Compressor summary: Hulk is a multimodal human-centric generalist model that can handle various perception tasks without fine-tuning, by using discrete and continuous representations for different modalities.

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

Zhenxin Li,Shiyi Lan,Jose M. Alvarez,Zuxuan Wu

http://arxiv.org/abs/2312.01696v1

Compressor summary: The paper proposes BEVNeXt, a dense Bird's Eye View framework for 3D object detection that combines depth estimation, temporal aggregation, and perspective techniques with CRF-modulated depth embedding, achieving state-of-the-art results on the nuScenes benchmark.

Risk-Controlling Model Selection via Guided Bayesian Optimization

Bracha Laufer-Goldshtein,Adam Fisch,Regina Barzilay,Tommi Jaakkola

http://arxiv.org/abs/2312.01692v1

Compressor summary: The paper proposes a method to find machine learning model configurations that balance various risks and metrics using Bayesian Optimization and risk-controlling procedures.

Optimizing Bus Travel: A Novel Approach to Feature Mining with P-KMEANS and P-LDA Algorithms

Hongjie Liu,Haotian Shi,Sicheng Fu,Tengfei Yuan,Xinhuan Zhang,Hongzhe Xu,Bin Ran

http://arxiv.org/abs/2312.01687v1

Compressor summary: The study presents a method for extracting features from bus travel data using Point of Interest (POI) data, enhanced P-KMENAS and P-LDA algorithms, which can improve bus travel attractiveness, usage, congestion, and emissions by understanding travel behavior.

ResEnsemble-DDPM: Residual Denoising Diffusion Probabilistic Models for Ensemble Learning

Shi Zhenning,Dong Changsheng,Xie Xueshuo,Pan Bin,He Along,Li Tao

http://arxiv.org/abs/2312.01682v1

Compressor summary: ResEnsemble-DDPM is a method that combines denoising diffusion probabilistic models and end-to-end models for better image segmentation by introducing a residual term and using ensemble learning.

Jellyfish: A Large Language Model for Data Preprocessing

Haochen Zhang,Yuyang Dong,Chuan Xiao,Masafumi Oyamada

http://arxiv.org/abs/2312.01678v1

Compressor summary: The paper introduces Jellyfish, an open-source LLM for DP tasks that can operate on a low-priced GPU, learn domain knowledge during tuning, and explain its output decisions with an interpreter.

Multi-task Image Restoration Guided By Robust DINO Features

Xin Lin,Chao Ren,Kelvin C. K. Chan,Lu Qi,Jinshan Pan,Ming-Hsuan Yang

http://arxiv.org/abs/2312.01677v1

Compressor summary: DINO-IR is a novel multi-task image restoration approach that uses robust features from DINOv2 to achieve better performance than existing methods in various tasks.

EDALearn: A Comprehensive RTL-to-Signoff EDA Benchmark for Democratized and Reproducible ML for EDA Research

Jingyu Pan,Chen-Chia Chang,Zhiyao Xie,Yiran Chen

http://arxiv.org/abs/2312.01674v1

Compressor summary: The paper introduces EDALearn, a benchmark suite for Machine Learning in Electronic Design Automation, which provides a comprehensive and open-source dataset with end-to-end data collection, analysis, and reproducibility to promote research and efficiency in VLSI design.

STADEE: STAtistics-based DEEp Detection of Machine Generated Text

Zheng Chen,Huming Liu

http://arxiv.org/abs/2312.01672v1

Compressor summary: STADEE is a novel deep detection method that combines statistics with deep learning to identify machine-generated text, outperforming existing methods in various scenarios.

Multimodality-guided Image Style Transfer using Cross-modal GAN Inversion

Hanyu Wang,Pengxiang Wu,Kevin Dela Rosa,Chen Wang,Abhinav Shrivastava

http://arxiv.org/abs/2312.01671v1

Compressor summary: The text introduces a novel method for MultiModality-guided Image Style Transfer (MMIST) that improves style transfer based on text guidance and allows inputs from various sources, achieving state-of-the-art performance on TIST task and effectiveness on MMIST task.

Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training

Runze He,Shaofei Huang,Xuecheng Nie,Tianrui Hui,Luoqi Liu,Jiao Dai,Jizhong Han,Guanbin Li,Si Liu

http://arxiv.org/abs/2312.01663v1

Compressor summary: The paper introduces a CustomNeRF model that can edit 3D scenes based on texts or images, addressing challenges like foreground editing and multi-view consistency.

ChatGPT as a Math Questioner? Evaluating ChatGPT on Generating Pre-university Math Questions

Phuoc Pham Van Long,Duc Anh Vu,Nhat M. Hoang,Xuan Long Do,Anh Tuan Luu

http://arxiv.org/abs/2312.01661v1

Compressor summary: The paragraph discusses using ChatGPT, a large language model, to generate mathematical questions for different levels of education and evaluates its performance in both context-aware and context-unaware settings.

RiskBench: A Scenario-based Benchmark for Risk Identification

Chi-Hsi Kung,Chieh-Chi Yang,Pang-Yuan Pao,Shu-Wei Lu,Pin-Lun Chen,Hsin-Cheng Lu,Yi-Ting Chen

http://arxiv.org/abs/2312.01659v1

Compressor summary: The paper introduces RiskBench, a benchmark for evaluating risk identification algorithms in intelligent driving systems that aim to achieve zero collisions.

AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix

Yun Yue,Zhiling Ye,Jiadi Jiang,Yongchao Liu,Ke Zhang

http://arxiv.org/abs/2312.01658v1

Compressor summary: The paper proposes AGD, a new optimizer for deep learning that uses a novel preconditioning matrix and an auto-switching function to improve generalization performance on various datasets.

On Tuning Neural ODE for Stability, Consistency and Faster Convergence

Sheikh Waqas Akhtar

http://arxiv.org/abs/2312.01657v1

Compressor summary: Neural-ODE use continuous neural networks to solve differential equations with advantages in memory efficiency, adaptability, and flexibility, but they have stability issues that can be addressed by using Nesterov's accelerated gradient (NAG) based ODE-solver.

An End-to-End Network Pruning Pipeline with Sparsity Enforcement

Evan Dogariu

http://arxiv.org/abs/2312.01653v1

Compressor summary: The text describes a new end-to-end training pipeline for neural network sparsification that reduces model size, complexity, and memory footprint while maintaining competitive performance.

Adaptive Confidence Threshold for ByteTrack in Multi-Object Tracking

Linh Van Ma,Muhammad Ishfaq Hussain,JongHyun Park,Jeongbae Kim,Moongu Jeon

http://arxiv.org/abs/2312.01650v1

Compressor summary: The paper presents an improved version of ByteTrack, a multiple object tracking algorithm, that adapts its confidence threshold based on detection performance.

Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation

Randall Balestriero,Romain Cosentino,Sarath Shekkizhar

http://arxiv.org/abs/2312.01648v1

Compressor summary: The authors propose a geometric approach to understand and manipulate large language models, revealing new features that help solve various tasks without relying on approximations or fine-tuning.

Voice-Based Smart Assistant System for Vehicles using RASA

Aditya Paranjape,Yash Patwardhan,Vedant Deshpande,Aniket Darp,Jayashree Jagdale

http://arxiv.org/abs/2312.01642v1

Compressor summary: The paper presents a voice-based chatbot for cars to improve road safety by automating tasks such as navigation, calls, weather forecasts, and music using voice commands instead of manual actions.

SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm

Jiandong Jin,Xiao Wang,Chenglong Li,Lili Huang,Jin Tang

http://arxiv.org/abs/2312.01640v1

Compressor summary: The paper proposes a new generative model called SequencePAR for pedestrian attribute recognition that uses visual features, text prompts, and attention mechanisms to improve performance on complex and imbalanced data.

Robust Streaming, Sampling, and a Perspective on Online Learning

Evan Dogariu,Jiatong Yu

http://arxiv.org/abs/2312.01634v1

Compressor summary: The paper introduces statistical learning, robust streaming techniques, and their connections, aiming to enlighten and inspire further research in both fields.

GaussianHead: Impressive 3D Gaussian-based Head Avatars with Dynamic Hybrid Neural Field

Jie Wang,Xianyan Li,Jiucheng Xie,Feng Xu,Hao Gao

http://arxiv.org/abs/2312.01632v1

Compressor summary: GaussianHead is a head avatar algorithm that uses 3D gaussian primitives to represent dynamic scenes efficiently and accurately, achieving optimal visual results in various tasks.

CLAMP: Contrastive LAnguage Model Prompt-tuning

Piotr Teterwak,Ximeng Sun,Bryan A. Plummer,Kate Saenko,Ser-Nam Lim

http://arxiv.org/abs/2312.01629v1

Compressor summary: The paper explores adapting large language models (LLMs) for image classification using contrastive learning, improving their performance and retaining their generative abilities.

GVFs in the Real World: Making Predictions Online for Water Treatment

Muhammad Kamran Janjua,Haseeb Shah,Martha White,Erfan Miahi,Marlos C. Machado,Adam White

http://arxiv.org/abs/2312.01624v1

Compressor summary: The paper explores using reinforcement learning for predicting water treatment plant operations and shows that online learning improves prediction accuracy.

Universal Segmentation at Arbitrary Granularity with Language Instruction

Yong Liu,Cairong Zhang,Yitong Wang,Jiahao Wang,Yujiu Yang,Yansong Tang

http://arxiv.org/abs/2312.01623v1

Compressor summary: The paper introduces UniLSeg, a universal segmentation model that can perform segmentation at any semantic level using language instructions and a unified data format.

How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking

Zhengyu Hu,Jieyu Zhang,Yue Yu,Yuchen Zhuang,Hui Xiong

http://arxiv.org/abs/2312.01619v1

Compressor summary: LEMR is a framework that reduces annotation costs for model selection tasks by using ensemble methods, uncertainty sampling, and Z-score refinement, achieving comparable results to fully labeled datasets with less labeling effort.

SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System

Yunfei Fan,Tianyu Zhao,Guidong Wang

http://arxiv.org/abs/2312.01616v1

Compressor summary: The SchurVINS framework combines high accuracy and low computational complexity for visual inertial navigation systems by using a complete residual model and Schur complement.

xNeuSM: Explainable Neural Subgraph Matching with Graph Learnable Multi-hop Attention Networks

Duc Q. Nguyen,Thanh Toan Nguyen,Tho quan

http://arxiv.org/abs/2312.01612v1

Compressor summary: The article introduces xNeuSM, an explainable neural subgraph matching method that adapts attention factors for each node and improves prediction accuracy and query time compared to existing methods.

Deep Learning-Driven Enhancement of Welding Quality Control: Predicting Welding Depth and Pore Volume in Hairpin Welding

Amena Darwish,Stefan Ericson,Rohollah Ghasemi,Tobias Andersson,Dan Lönn,Andreas Andersson Lassila,Kent Salomonsson

http://arxiv.org/abs/2312.01606v1

Compressor summary: The study proposes a deep learning model that predicts two critical weld characteristics using various laser welding input factors and shows promising results for improving welding quality assurance.

TextAug: Test time Text Augmentation for Multimodal Person Re-identification

Mulham Fawakherji,Eduard Vazquez,Pasquale Giampa,Binod Bhattarai

http://arxiv.org/abs/2312.01605v1

Compressor summary: The paragraph discusses a new text augmentation technique called CutMixOut, which combines cutout and cutmix methods to improve multimodal person re-identification performance.

Local-Global History-aware Contrastive Learning for Temporal Knowledge Graph Reasoning

Wei Chen,Huaiyu Wan,Yuting Wu,Shuyuan Zhao,Jiayaqi Cheng,Yuxin Li,Youfang Lin

http://arxiv.org/abs/2312.01601v1

Compressor summary: The paper proposes a new method, LogCL, for predicting future facts in temporal knowledge graphs by using contrastive learning to fuse local and global historical information and improving robustness against noise.

Good Questions Help Zero-Shot Image Reasoning

Kaiwen Yang,Tao Shen,Xinmei Tian,Xiubo Geng,Chongyang Tao,Dacheng Tao,Tianyi Zhou

http://arxiv.org/abs/2312.01598v1

Compressor summary: The paragraph introduces QVix, a new prompting strategy for large vision-language models to improve their zero-shot image reasoning capabilities by asking more detailed questions about the input images.

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Feng Wang,Jieru Mei,Alan Yuille

http://arxiv.org/abs/2312.01597v1

Compressor summary: The paper proposes a novel self-attention mechanism called Correlative Self-Attention (CSA) that adapts CLIP for zero-shot semantic segmentation, achieving significant improvements over existing methods.

Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment

Cong-Duy Nguyen,The-Anh Vu-Le,Thong Nguyen,Tho Quan,Luu Anh Tuan

http://arxiv.org/abs/2312.01592v1

Compressor summary: The paper introduces GroundedBERT, a method that combines BERT with visual information using Optimal Transport to improve language learning for grounded tasks.

OCGEC: One-class Graph Embedding Classification for DNN Backdoor Detection

Haoyu Jiang,Haiyang Yu,Nan Li,Ping Yi

http://arxiv.org/abs/2312.01585v1

Compressor summary: The study proposes OCGEC, a novel one-class classification framework using graph neural networks for model-level backdoor detection in DNNs with minimal clean data and achieves high AUC scores.

Explaining with Contrastive Phrasal Highlighting: A Case Study in Assisting Humans to Detect Translation Differences

Eleftheria Briakou,Navita Goyal,Marine Carpuat

http://arxiv.org/abs/2312.01582v1

Compressor summary: The authors propose a technique to generate contrastive highlights for explaining predictions of semantic divergence models, which improves on existing saliency methods in capturing fine-grained meaning differences.

Signed Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off

Sachit Kuhar,Yash Jain,Alexey Tumanov

http://arxiv.org/abs/2312.01581v1

Compressor summary: Signed Binarization is a framework that improves accuracy and efficiency of DNNs on edge devices by combining hardware-software systems, quantization functions, and representation learning techniques to balance repetition and sparsity during inference.

RJHMC-Tree for Exploration of the Bayesian Decision Tree Posterior

Jodie A. Cochrane,Adrian G. Wills,Sarah J. Johnson

http://arxiv.org/abs/2312.01577v1

Compressor summary: The paper proposes a new algorithm for learning Bayesian decision trees using Hamiltonian Monte Carlo (HMC) to improve efficiency and exploration of the posterior.

Learning Efficient Unsupervised Satellite Image-based Building Damage Detection

Yiyun Zhang,Zijian Wang,Yadan Luo,Xin Yu,Zi Huang

http://arxiv.org/abs/2312.01576v1

Compressor summary: The paper proposes U-BDD++, a self-supervised framework for detecting building damage from unlabelled satellite images, using vision-language models to handle domain-specific issues and improve training quality.

A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

Keito Kudo,Haruki Nagasawa,Jun Suzuki,Nobuyuki Shimizu

http://arxiv.org/abs/2312.01575v1

Compressor summary: The paper introduces a new video summarization task that combines keyframe selection and caption generation, creates a dataset to evaluate it, and proposes a practical application for the task.

How to Configure Good In-Context Sequence for Visual Question Answering

Li Li,Jiawei Peng,Huiyi Chen,Chongyang Gao,Xu Yang

http://arxiv.org/abs/2312.01571v1

Compressor summary: The study explores diverse in-context configurations for Large Vision-Language Models using Visual Question Answering and improves their performance, while gaining insights into the inner properties of these models.

Toward Automated Quantum Variational Machine Learning

Omer Subasi

http://arxiv.org/abs/2312.01567v1

Compressor summary: The paper presents MUSE, a search algorithm for quantum variational machine learning that improves accuracy in classification and regression tasks compared to previous methods.

APoLLo: Unified Adapter and Prompt Learning for Vision Language Models

Sanjoy Chowdhury,Sayan Nag,Dinesh Manocha

http://arxiv.org/abs/2312.01564v1

Compressor summary: APoLLo is a method that combines adapter and prompt learning for vision-language models, improving their generalization in few-shot settings by using trainable cross-attention layers and enforcing encoder consistency.

Multi-View Person Matching and 3D Pose Estimation with Arbitrary Uncalibrated Camera Networks

Yan Xu,Kris Kitani

http://arxiv.org/abs/2312.01561v1

Compressor summary: PME is a method for cross-view person matching and 3D human pose estimation in multi-camera networks without requiring 3D data or camera poses, using clustering and geometric constraints to solve the problem.

Hyperspectral Image Compression Using Sampling and Implicit Neural Representations

Shima Rezasoltani,Faisal Z. Qureshi

http://arxiv.org/abs/2312.01558v1

Compressor summary: The paper proposes a hyperspectral image compression method using neural networks that outperforms existing methods at low bitrates and is faster with sampling.

Explainable AI is Responsible AI: How Explainability Creates Trustworthy and Socially Responsible Artificial Intelligence

Stephanie Baker,Wei Xiang

http://arxiv.org/abs/2312.01555v1

Compressor summary: The paragraph discusses how explainable AI (XAI) is not only important for transparency but also crucial for ensuring fairness, robustness, privacy, security, and transparency in various applications of responsible AI (RAI).

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Bill Yuchen Lin,Abhilasha Ravichander,Ximing Lu,Nouha Dziri,Melanie Sclar,Khyathi Chandu,Chandra Bhagavatula,Yejin Choi

http://arxiv.org/abs/2312.01552v1

Compressor summary: The paragraph discusses a study (LIMA) that shows alignment tuning in large language models may be superficial, as base models and aligned versions perform similarly on most tokens, and proposes a new method (URIAL) for tuning-free alignment using in-context learning.

KEEC: Embed to Control on An Equivariant Geometry

Xiaoyuan Cheng,Yiming Yang,Wei Jiang,Yukun Hu

http://arxiv.org/abs/2312.01544v1

Compressor summary: The paper presents KEEC, a method for learning and controlling dynamical systems with complex dynamics using equivariant geometry, which achieves quadratic convergence and outperforms other loss functions.