arxiv compressed, 2023-12-06

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2023-12-06 generated by the compressor, my personal LLM-based project.


ReconFusion: 3D Reconstruction with Diffusion Priors

Rundi Wu,Ben Mildenhall,Philipp Henzler,Keunhong Park,Ruiqi Gao,Daniel Watson,Pratul P. Srinivasan,Dor Verbin,Jonathan T. Barron,Ben Poole,Aleksander Holynski

http://arxiv.org/abs/2312.02981v1

Compressor summary: ReconFusion is a method that uses a diffusion prior to reconstruct 3D scenes from few input images with realistic geometry and texture.


GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Zhangyang Qi,Ye Fang,Zeyi Sun,Xiaoyang Wu,Tong Wu,Jiaqi Wang,Dahua Lin,Hengshuang Zhao

http://arxiv.org/abs/2312.02980v1

Compressor summary: GPT4Point is a new model that improves 3D object understanding and generation using point-text features and Pyramid-XL, a large dataset annotation engine.


Describing Differences in Image Sets with Natural Language

Lisa Dunlap,Yuhui Zhang,Xiaohan Wang,Ruiqi Zhong,Trevor Darrell,Jacob Steinhardt,Joseph E. Gonzalez,Serena Yeung-Levy

http://arxiv.org/abs/2312.02974v1

Compressor summary: The text describes a method called Set Difference Captioning that automatically generates descriptions of differences between two sets of images using a two-stage approach and a dataset called VisDiffBench.


GauHuman: Articulated Gaussian Splatting from Monocular Human Videos

Shoukang Hu,Ziwei Liu

http://arxiv.org/abs/2312.02973v1

Compressor summary: GauHuman is a fast 3D human model that uses Gaussian Splatting for training and rendering, achieving state-of-the-art performance without compromising quality.


Alchemist: Parametric Control of Material Properties with Diffusion Models

Prafull Sharma,Varun Jampani,Yuanzhen Li,Xuhui Jia,Dmitry Lagun,Fredo Durand,William T. Freeman,Mark Matthews

http://arxiv.org/abs/2312.02970v1

Compressor summary: The proposed method uses text-to-image models and a synthetic dataset with controlled material properties to edit object attributes like roughness, metallic, albedo, and transparency in real images while preserving other features.


Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models

Xinyu Zhang,Sebastian Hofstätter,Patrick Lewis,Raphael Tang,Jimmy Lin

http://arxiv.org/abs/2312.02969v1

Compressor summary: The authors propose a new listwise reranker that does not depend on GPT models and outperforms existing ones in passage retrieval experiments, highlighting the need for better listwise ranking data.


AmbiGen: Generating Ambigrams from Pre-trained Diffusion Model

Boheng Zhao,Rana Hanocka,Raymond A. Yeh

http://arxiv.org/abs/2312.02967v1

Compressor summary: The paper proposes a new method to generate ambigrams using a large-scale vision and language diffusion model that improves legibility and accuracy.


Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

Cheng-Ju Ho,Chen-Hsuan Tai,Yen-Yu Lin,Ming-Hsuan Yang,Yi-Hsuan Tsai

http://arxiv.org/abs/2312.02966v1

Compressor summary: The paper proposes Diffusion-SS3D, a method that uses a diffusion model to improve pseudo-label generation and semi-supervised object detection in 3D scenes by incorporating noise and denoising it.


MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

Zhangyang Xiong,Chenghong Li,Kenkun Liu,Hongjie Liao,Jianqiao Hu,Junyi Zhu,Shuliang Ning,Lingteng Qiu,Chongjie Wang,Shijie Wang,Shuguang Cui,Xiaoguang Han

http://arxiv.org/abs/2312.02963v1

Compressor summary: MVHumanNet is a large-scale 3D human dataset that enables progress in various visual tasks, addressing the lack of high-quality human data for 3D vision research.


Classification for everyone : Building geography agnostic models for fairer recognition

Akshat Jindal,Shreya Singh,Soham Gadgil

http://arxiv.org/abs/2312.02957v1

Compressor summary: The paper explores and proposes solutions for the geographical biases in image classification models using two datasets and various mitigation methods.


LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang,Hongyang Li,Feng Li,Tianhe Ren,Xueyan Zou,Shilong Liu,Shijia Huang,Jianfeng Gao,Lei Zhang,Chunyuan Li,Jianwei Yang

http://arxiv.org/abs/2312.02949v1

Compressor summary: The authors propose a new dataset for grounded visual chat (GVC), a benchmark called Grounding-Bench, and a model that combines segmentation and language models to improve GVC capabilities.


Drag-A-Video: Non-rigid Video Editing with Point-based Interaction

Yao Teng,Enze Xie,Yue Wu,Haoyu Han,Zhenguo Li,Xihui Liu

http://arxiv.org/abs/2312.02936v1

Compressor summary: The paper presents Drag-A-Video, a diffusion-based method for interactive point-based video manipulation that allows users to modify the contents of videos by dragging points and masks across frames.


WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

Jiachen Lu,Ze Huang,Jiahui Zhang,Zeyu Yang,Li Zhang

http://arxiv.org/abs/2312.02934v1

Compressor summary: WoVoGen is a system that generates high-quality and diverse street-view videos for autonomous driving datasets by using a 4D world volume and sensor interconnectivity.


WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words

Lukas Wolf,Klemen Kotar,Greta Tuckute,Eghbal Hosseini,Tamar Regev,Ethan Wilcox,Alex Warstadt

http://arxiv.org/abs/2312.02931v1

Compressor summary: The paper introduces Whisbert, a multimodal language model that combines text and audio, but finds that it does not improve over the text-only version in terms of optimization and performance.


LivePhoto: Real Image Animation with Text-guided Motion Control

Xi Chen,Zhiheng Liu,Mengting Chen,Yutong Feng,Yu Liu,Yujun Shen,Hengshuang Zhao

http://arxiv.org/abs/2312.02928v1

Compressor summary: LivePhoto is a system that uses text to animate images with temporal motions and allows users to adjust the intensity of the motion.


Split & Merge: Unlocking the Potential of Visual Adapters via Sparse Training

Qizhe Zhang,Bocheng Zou,Ruichuan An,Jiaming Liu,Shanghang Zhang

http://arxiv.org/abs/2312.02923v1

Compressor summary: MoSA is a new Adapter Tuning method that splits adapters into modules, stochastically activates them for sparse training, and merges them after tuning to achieve better performance than standard methods without extra overhead.


Fine-grained Controllable Video Generation via Object Appearance and Context

Hsin-Ping Huang,Yu-Chuan Su,Deqing Sun,Lu Jiang,Xuhui Jia,Yukun Zhu,Ming-Hsuan Yang

http://arxiv.org/abs/2312.02919v1

Compressor summary: FACTOR is a text-to-video generation method that allows detailed control of objects' appearances, context, location, and category by injecting control signals into the existing model using joint encoder and adaptive cross-attention layers.


Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration

Yuang Ai,Huaibo Huang,Xiaoqiang Zhou,Jiexiang Wang,Ran He

http://arxiv.org/abs/2312.02918v1

Compressor summary: MPerceiver is a multimodal prompt learning approach that leverages Stable Diffusion priors to improve adaptiveness, generalizability, and fidelity for all-in-one image restoration, outperforming state-of-the-art methods in multiple tasks.


MIND: Multi-Task Incremental Network Distillation

Jacopo Bonato,Francesco Pelosin,Luigi Sabetta,Alessandro Nicolosi

http://arxiv.org/abs/2312.02916v1

Compressor summary: MIND is a method that improves replay-free learning in dynamic data streams, achieving state-of-the-art results on several benchmarks with significant accuracy gains.


Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training

Arun Reddy,William Paul,Corban Rivera,Ketul Shah,Celso M. de Melo,Rama Chellappa

http://arxiv.org/abs/2312.02914v1

Compressor summary: UNITE is a method that adapts a video student model to a new domain using an image teacher model with self-supervised pre-training and self-training.


Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

Zahra Abbasiantaeb,Yifei Yuan,Evangelos Kanoulas,Mohammad Aliannejadi

http://arxiv.org/abs/2312.02913v1

Compressor summary: The proposed framework simulates human-like conversations for question-answering systems using GPT-4 as both student and teacher, evaluating their performance and comparing them to human-generated conversations.


HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting

Helisa Dhamo,Yinyu Nie,Arthur Moreau,Jifei Song,Richard Shaw,Yiren Zhou,Eduardo Pérez-Pellitero

http://arxiv.org/abs/2312.02902v1

Compressor summary: HeadGaS is a hybrid model that uses 3D Gaussian Splats and learnable latent features for fast and high-quality 3D head reconstruction and animation.


Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review

Cristiano Mesquita Garcia,Ramon Simoes Abilio,Alessandro Lameiras Koerich,Alceu de Souza Britto Jr.,Jean Paul Barddal

http://arxiv.org/abs/2312.02901v1

Compressor summary: The paragraph discusses how researchers are working on discovering patterns in textual data from social media and other sources, and the challenges they face due to concept drift and outdated datasets and models.


BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Rizhao Cai,Zirui Song,Dayan Guan,Zhenhao Chen,Xing Luo,Chenyu Yi,Alex Kot

http://arxiv.org/abs/2312.02896v1

Compressor summary: The paper introduces BenchLMM, a benchmark to evaluate how well large multimodal models (LMMs) can handle different image styles, and suggests a method to improve their performance by having them predict the style first.


Towards More Practical Group Activity Detection: A New Benchmark and Model

Dongkeun Kim,Youngkil Song,Minsu Cho,Suha Kwak

http://arxiv.org/abs/2312.02878v1

Compressor summary: The authors introduce a new large-scale dataset (Caf'e) for group activity detection (GAD) in videos, along with a new model that handles unknown groups and members better than previous approaches.


A Dynamic Network for Efficient Point Cloud Registration

Yang Ai,Xi Yang

http://arxiv.org/abs/2312.02877v1

Compressor summary: The paper proposes a dynamic, iterative point cloud registration method that uses deep global sampling and local registration to remove noisy points, improving efficiency by over 40% on two datasets.


Toward autocorrection of chemical process flowsheets using large language models

Lukas Schulze Balhorn,Marc Caballero,Artur M. Schweidtmann

http://arxiv.org/abs/2312.02873v1

Compressor summary: The authors propose using generative AI and large language models to automatically correct errors in process flow diagrams and instrumentation diagrams, which could improve safety, efficiency, and cost savings in the process engineering domain.


Experimental Insights Towards Explainable and Interpretable Pedestrian Crossing Prediction

Angie Nataly Melo,Carlota Salinas,Miguel Angel Sotelo

http://arxiv.org/abs/2312.02872v1

Compressor summary: The research proposes an explainable and interpretable pedestrian crossing prediction method using deep learning and fuzzy logic.


Attention-enhanced neural differential equations for physics-informed deep learning of ion transport

Danyal Rehman,John H. Lienhard

http://arxiv.org/abs/2312.02871v1

Compressor summary: The authors propose a machine learning approach to model ion transport in nanoporous membranes, using attention-enhanced neural differential equations that incorporate electroneutrality biases and outperform conventional PDE-based methods.


Semi-Supervised Health Index Monitoring with Feature Generation and Fusion

Gaëtan Frusque,Ismail Nejjar,Majid Nabavi,Olga Fink

http://arxiv.org/abs/2312.02867v1

Compressor summary: The authors propose a semi-supervised method to construct Health Index (HI) for system health evaluation using run-to failure datasets and deep learning, addressing interpretability and sensitivity issues, and applying it to monitor wear states of thermal spray coatings.


Lessons from Usable ML Deployments and Application to Wind Turbine Monitoring

Alexandra Zytek,Wei-En Wang,Sofia Koukoura,Kalyan Veeramachaneni

http://arxiv.org/abs/2312.02859v1

Compressor summary: The paragraph discusses three key lessons learned from deploying usable machine learning in real-world domains and how they can be applied to wind turbine monitoring for decision-making in renewable energy.


Towards Causal Representations of Climate Model Data

Julien Boussard,Chandni Nagda,Julia Kaltenborn,Charlotte Emilie Elektra Lange,Philippe Brouillard,Yaniv Gurwicz,Peer Nowack,David Rolnick

http://arxiv.org/abs/2312.02858v1

Compressor summary: The authors explore how causal representation learning, specifically CDSD, can improve the efficiency and interpretability of climate model emulators for simulating future climate change scenarios.


Expert-guided Bayesian Optimisation for Human-in-the-loop Experimental Design of Known Systems

Tom Savage,Ehecatl Antonio del Rio Chanona

http://arxiv.org/abs/2312.02852v1

Compressor summary: The authors propose a method to integrate human expertise into Bayesian optimization by allowing experts to choose between multiple optimal solutions with high utility and low redundancy at each iteration.


Are Vision Transformers More Data Hungry Than Newborn Visual Systems?

Lalit Pandey,Samantha M. W. Wood,Justin N. Wood

http://arxiv.org/abs/2312.02843v1

Compressor summary: The study shows that vision transformers (ViTs) can learn view invariant object recognition tasks like newborn chicks when trained on similar impoverished visual environments, challenging the assumption that ViTs are more data hungry than biological systems.


MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition

Nicolas Menet,Michael Hersche,Geethan Karunaratne,Luca Benini,Abu Sebastian,Abbas Rahimi

http://arxiv.org/abs/2312.02829v1

Compressor summary: The paragraph describes MIMONets, which are neural networks that can process multiple inputs simultaneously using variable binding mechanisms and superposition, achieving speedup and accuracy trade-offs in various architectures like CNN and Transformer.


Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis

Florent Forest,Olga Fink

http://arxiv.org/abs/2312.02826v1

Compressor summary: The paper proposes a new unsupervised domain adaptation method called Calibrated Adaptive Teacher (CAT) to improve intelligent fault diagnosis using deep learning, which calibrates the teacher network's predictions during self-training.


RotaTR: Detection Transformer for Dense and Rotated Object

Zhu Yuke,Ruan Yumeng,Yang Lei,Guo Sheng

http://arxiv.org/abs/2312.02821v1

Compressor summary: The paper proposes RotaTR, an extension of DETR that uses Rotation Sensitive deformable attention to improve detection of dense and rotated objects in scenes.


Clustering Pseudo Language Family in Multilingual Translation Models with Fisher Information Matrix

Xinyu Ma,Xuebo Liu,Min Zhang

http://arxiv.org/abs/2312.02820v1

Compressor summary: The authors propose a new method using fisher information matrix to cluster languages into pseudo families, which improves multilingual translation model performance and language similarity measurements.


Deterministic Guidance Diffusion Model for Probabilistic Weather Forecasting

Donggeun Yoon,Minseok Seo,Doyi Kim,Yeji Choi,Donghyeon Cho

http://arxiv.org/abs/2312.02819v1

Compressor summary: The paper introduces a new model, DGDM, that combines deterministic and probabilistic methods for accurate and probabilistic weather forecasting and evaluates it on various datasets.


BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Fengyuan Shi,Jiaxi Gu,Hang Xu,Songcen Xu,Wei Zhang,Limin Wang

http://arxiv.org/abs/2312.02813v1

Compressor summary: The authors propose BIVDiff, a training-free framework for general-purpose video synthesis that combines image diffusion models with text-to-video foundation diffusion models to address challenges in downstream video tasks.


Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems

Céline Comte,Matthieu Jonckheere,Jaron Sanders,Albert Senen-Cerda

http://arxiv.org/abs/2312.02804v1

Compressor summary: The paper introduces score-aware gradient estimators (SAGEs) for policy-gradient methods in large state and action space Markov decision processes (MDPs), which can estimate the policy gradient without value-function estimation and have better convergence properties, especially for product-form stationary distributions.


Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic IR in English and Arabic

Vera Pavlova

http://arxiv.org/abs/2312.02803v1

Compressor summary: The authors present a novel approach to Qur'anic information retrieval using neural methods, data augmentation, and domain-specific language models in both English and Arabic, achieving state-of-the-art results.


Weakly Supervised Detection of Hallucinations in LLM Activations

Miriam Rateike,Celia Cintas,John Wamburu,Tanya Akumu,Skyler Speakman

http://arxiv.org/abs/2312.02798v1

Compressor summary: The authors propose an auditing method to detect anomalies in large language models' internal states and identify the nodes responsible for encoding hallucinations.


Large Language Models on Graphs: A Comprehensive Survey

Bowen Jin,Gang Liu,Chi Han,Meng Jiang,Heng Ji,Jiawei Han

http://arxiv.org/abs/2312.02783v1

Compressor summary: The paragraph discusses the applications and techniques of large language models on graph data, and provides a systematic review of scenarios and methods for using them in various contexts.


PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features

Tianshun Han,Shengnan Gui,Yiqing Huang,Baihui Li,Lijian Liu,Benjia Zhou,Ning Jiang,Quan Lu,Ruicong Zhi,Yanyan Liang,Du Zhang,Jun Wan

http://arxiv.org/abs/2312.02781v1

Compressor summary: PMMTalk is a novel framework that uses pseudo multi-modal features to improve 3D facial animation by incorporating visual and textual cues from speech, requiring only an additional reference image for more accurate results.


Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions

Xu Shi,Chuanchen Luo,Junran Peng,Hongwen Zhang,Yunlian Sun

http://arxiv.org/abs/2312.02772v1

Compressor summary: The paper introduces FG-MDM, a framework that uses a large language model to parse vague textual annotations into fine-grained descriptions of human motions and generates fine-grained and stylized motions with a transformer-based diffusion model.


Learning "Look-Ahead" Nonlocal Traffic Dynamics in a Ring Road

Chenguang Zhao,Huan Yu

http://arxiv.org/abs/2312.02770v1

Compressor summary: The paper proposes a data-enhanced nonlocal traffic model using a physics-informed neural network to learn the look-ahead dynamics and improve traffic wave prediction.


C-NERF: Representing Scene Changes as Directional Consistency Difference-based NeRF

Rui Huang,Binbin Jiang,Qingyi Zhao,William Wang,Yuxiang Zhang,Qing Guo

http://arxiv.org/abs/2312.02751v1

Compressor summary: The authors propose C-NERF, a method to detect changes in a scene represented by neural radiance fields (NeRFs), which outperforms existing 2D change detection and NeRF-based methods.


Compositional Generalization for Data-to-Text Generation

Xinnuo Xu,Ivan Titov,Mirella Lapata

http://arxiv.org/abs/2312.02748v1

Compressor summary: The paragraph discusses data-to-text generation challenges, proposes a compositional generalization benchmark, and introduces a new model that clusters predicates for improved textual descriptions.


LExCI: A Framework for Reinforcement Learning with Embedded Systems

Kevin Badalian,Lucas Koch,Tobias Brinkmann,Mario Picerno,Marius Wegener,Sung-Yong Lee,Jakob Andert

http://arxiv.org/abs/2312.02739v1

Compressor summary: The paper introduces LExCI, a free and open-source framework that allows training agents on embedded systems using the RLlib library, overcoming challenges faced by professionals in control engineering.


Towards Measuring Representational Similarity of Large Language Models

Max Klabunde,Mehdi Ben Amor,Michael Granitzer,Florian Lemmerich

http://arxiv.org/abs/2312.02730v1

Compressor summary: The authors investigate how similar large language models (LLMs) with 7B parameters are and find that some LLMs differ significantly, while cautioning about potential pitfalls in measuring similarity.


R3D-SWIN:Use Shifted Window Attention for Single-View 3D Reconstruction

Chenhuan Li,Meihua Xiao,zehuan li,Mengxi Gao

http://arxiv.org/abs/2312.02725v1

Compressor summary: The authors propose a new method for voxel 3D reconstruction using shifted windows attention, which improves the accuracy of single-view reconstruction compared to existing methods.


Towards the Inferrence of Structural Similarity of Combinatorial Landscapes

Mingyu Huang,Ke Li

http://arxiv.org/abs/2312.02720v1

Compressor summary: The paper proposes using graph data mining techniques to analyze local optima networks and find structural similarities between fitness landscapes of different combinatorial optimization problems, which can help improve problem-solving by analogy.


(Provable) Adversarial Robustness for Group Equivariant Tasks: Graphs, Point Clouds, Molecules, and More

Jan Schuchardt,Yan Scholten,Stephan Günnemann

http://arxiv.org/abs/2312.02708v1

Compressor summary: The paper proposes a new way to measure adversarial robustness in machine learning models that considers task equivariance and provides methods to achieve provable robustness for various tasks.


Large Knowledge Model: Perspectives and Challenges

Huajun Chen

http://arxiv.org/abs/2312.02706v1

Compressor summary: Key points: - Human languages carry world knowledge - Large Language Models (LLMs) like ChatGPT process and manipulate world knowledge in neural networks - The article explores how symbolic knowledge (e.g., Knowledge Graphs) can enhance LLMs and how LLMs can amplify traditional knowledge bases - The authors propose Large Knowledge Models (LKM) to manage diverse knowledge structures and discuss some challenges and principles for LKM Summary: The article explores the role of symbolic knowledge in enhancing and amplifying Large Language Models (LLMs), which process world knowledge in neural networks, and proposes Large Knowledge Models (LKM) to manage diverse knowledge structures with some challenges and principles.


Unified learning-based lossy and lossless JPEG recompression

Jianghui Zhang,Yuanyuan Wang,Lina Guo,Jixiang Luo,Tongda Xu,Yan Wang,Zhi Wang,Hongwei Qin

http://arxiv.org/abs/2312.02705v1

Compressor summary: The paper presents a new method for compressing JPEG images that combines lossy and lossless techniques using learned quantization tables and hierarchical variational autoencoders, achieving low distortion near the bitrate limit.


MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

Bo Ding,Zhenfeng Fan,Shuang Yang,Shihong Xia

http://arxiv.org/abs/2312.02703v1

Compressor summary: Myportrait is a framework for generating realistic talking faces with personalized details using monocular video and 3D face morphable space, supporting both video-driven and audio-driven face animation and outperforming state-of-the-art methods.


Neural Sign Actors: A diffusion model for 3D sign language production from text

Vasileios Baltatzis,Rolandos Alexandros Potamias,Evangelos Ververas,Guanxiong Sun,Jiankang Deng,Stefanos Zafeiriou

http://arxiv.org/abs/2312.02702v1

Compressor summary: The proposed 3D diffusion-based model generates realistic Sign Language avatars using a novel graph neural network and outperforms previous methods, potentially reducing communication barriers between Deaf and hearing communities.


Revisit Human-Scene Interaction via Space Occupancy

Xinpeng Liu,Haowen Hou,Yanchao Yang,Yong-Lu Li,Cewu Lu

http://arxiv.org/abs/2312.02700v1

Compressor summary: The paper proposes a new approach to generate human-scene interaction (HSI) using motion-only data, which can handle complex scenes and generalize well without ground truth 3D scenes.


Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security

Muhammad Umer Ramzan,Usman Ali,Syed Haider Abbas Naqvi,Zeeshan Aslam,Tehseen,Husnain Ali,Muhammad Faheem

http://arxiv.org/abs/2312.02699v1

Compressor summary: The paragraph describes an auto management system that uses deep learning models for vehicle entrance and parking, integrating various technologies like vehicle detection, license plate recognition, and face recognition to ensure efficiency, security, and convenience.


Analyzing and Improving the Training Dynamics of Diffusion Models

Tero Karras,Miika Aittala,Jaakko Lehtinen,Janne Hellsten,Timo Aila,Samuli Laine

http://arxiv.org/abs/2312.02696v1

Compressor summary: The paper improves the ADM diffusion model for data-driven image synthesis by redesigning network layers to preserve magnitudes and introducing a method for tuning exponential moving average parameters after training.


UPOCR: Towards Unified Pixel-Level OCR Interface

Dezhi Peng,Zhenhua Yang,Jiaxin Zhang,Chongyu Liu,Yongxin Shi,Kai Ding,Fengjun Guo,Lianwen Jin

http://arxiv.org/abs/2312.02694v1

Compressor summary: The paper introduces UPOCR, a simple and effective generalist model that unifies diverse OCR tasks as image-to-image transformation using a vision Transformer encoder-decoder with learnable task prompts, achieving state-of-the-art performance on three tasks.


DeepPointMap: Advancing LiDAR SLAM with Unified Neural Descriptors

Xiaze Zhang,Ziheng Ding,Qi Jing,Yuejie Zhang,Wenchao Ding,Rui Feng

http://arxiv.org/abs/2312.02684v1

Compressor summary: The paper presents DeepPointMap, a unified architecture that uses neural networks to extract sparse neural descriptors from point clouds, achieving high localization accuracy and memory-efficient map representation for SLAM tasks and multi-agent collaboration.


H-GAP: Humanoid Control with a Generalist Planner

Zhengyao Jiang,Yingchen Xu,Nolan Wagener,Yicheng Luo,Michael Janner,Edward Grefenstette,Tim Rocktäschel,Yuandong Tian

http://arxiv.org/abs/2312.02682v1

Compressor summary: The paper introduces H-GAP, a model that generates humanoid trajectories from human motion-captured data and can handle various control tasks with MPC, outperforming baselines and transferring behaviors flexibly.


Amortized Bayesian Decision Making for simulation-based models

Mila Gorecki,Jakob H. Macke,Michael Deistler

http://arxiv.org/abs/2312.02674v1

Compressor summary: The authors propose a neural network method for Bayesian decision making on stochastic simulators without computing explicit posterior approximations, and show its effectiveness in both benchmark problems and a real-world medical neurosciences application.


Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection? An Investigation and the HOI-Synth Domain Adaptation Benchmark

Rosario Leonardi,Antonino Furnari,Francesco Ragusa,Giovanni Maria Farinella

http://arxiv.org/abs/2312.02672v1

Compressor summary: This study shows that synthetic data and domain adaptation techniques can improve egocentric hand-object interaction detection performance while reducing the need for real data annotations.


Lights out: training RL agents robust to temporary blindness

N. Ordonez,M. Tromp,P. M. Julbe,W. Böhmer

http://arxiv.org/abs/2312.02665v1

Compressor summary: The paragraph describes a method for training agents with DQN to handle real-world changes in observations by using a neural network with hidden representations and a new loss function that allows them to act until they get a recognized observation again, demonstrating robustness to temporary blindness.


FaceStudio: Put Your Face Everywhere in Seconds

Yuxuan Yan,Chi Zhang,Rui Wang,Pei Cheng,Gang Yu,Bin Fu

http://arxiv.org/abs/2312.02663v1

Compressor summary: The study presents a new approach for creating personalized images that maintain the subject's identity by combining stylized, facial, and textual guidance, achieving efficient and high-quality results.


A Self-Commissioning Edge Computing Method for Data-Driven Anomaly Detection in Power Electronic Systems

Pere Izquierdo Gomez,Miguel E. Lopez Gajardo,Nenad Mijatovic,Tomislav Dragicevic

http://arxiv.org/abs/2312.02661v1

Compressor summary: The text describes an edge computing method that uses autonomous data selection to improve condition monitoring of power electronic converters using field data and machine learning.


Do AI models produce better weather forecasts than physics-based models? A quantitative evaluation case study of Storm Ciarán

Andrew J. Charlton-Perez,Helen F. Dacre,Simon Driscoll,Suzanne L. Gray,Ben Harvey,Natalie J. Harvey,Kieran M. R. Hunt,Robert W. Lee,Ranjini Swaminathan,Remy Vandaele,Ambrogio Volonté

http://arxiv.org/abs/2312.02658v1

Compressor summary: The paragraph compares the accuracy of four machine learning models in forecasting the structure and details of Storm Ciar'an, a European windstorm, with numerical weather prediction models.


TPA3D: Triplane Attention for Fast Text-to-3D Generation

Hong-En Chen,Bin-Shih Wu,Sheng-Yu Huang,Yu-Chiang Frank Wang

http://arxiv.org/abs/2312.02647v1

Compressor summary: The paper introduces TPA3D, a GAN-based model for fast text-to-3D generation using attention mechanisms on text features and 3D shape data.


SAMSGL: Series-Aligned Multi-Scale Graph Learning for Spatio-Temporal Forecasting

Xiaobei Zou,Luolin Xiong,Yang Tang,Jurgen Kurths

http://arxiv.org/abs/2312.02646v1

Compressor summary: The authors propose a new framework for spatio-temporal forecasting that considers time delays and multi-scale interactions by using a series-aligned graph convolution layer and a multi-scale graph learning architecture.


Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

Camillo Quattrocchi,Antonino Furnari,Daniele Di Mauro,Mario Valerio Giuffrida,Giovanni Maria Farinella

http://arxiv.org/abs/2312.02638v1

Compressor summary: The paper proposes a method to adapt a temporal action segmentation system from exocentric to egocentric cameras using existing labeled and unlabeled video pairs without collecting new labels, achieving similar performance to supervised methods.


Diffusion Noise Feature: Accurate and Fast Generated Image Detection

Yichi Zhang,Xiaogang Xu

http://arxiv.org/abs/2312.02625v1

Compressor summary: The paper proposes a new image representation called Diffusion Noise Feature (DNF) that can effectively detect generated images by exploiting the differences between real and fake images in an inverse diffusion process within a pre-trained diffusion model.


On the Initialization of Graph Neural Networks

Jiahang Li,Yakun Song,Xiang Song,David Paul Wipf

http://arxiv.org/abs/2312.02622v1

Compressor summary: Virgo is a new initialization method for GNNs that reduces variance instability by considering the effects of activation function, hidden dimension, graph structure and message passing on forward and backward propagation.


Rethinking and Simplifying Bootstrapped Graph Latents

Wangbin Sun,Jintang Li,Liang Chen,Bingzhe Wu,Yatao Bian,Zibin Zheng

http://arxiv.org/abs/2312.02619v1

Compressor summary: The paper proposes SGCL, a simple and efficient graph self-supervised learning framework that eliminates negative samples and reduces model complexity by using outputs from two consecutive iterations as positive pairs.


Facilitating the Production of Well-tailored Video Summaries for Sharing on Social Media

Evlampios Apostolidis,Konstantinos Apostolidis,Vasileios Mezaris

http://arxiv.org/abs/2312.02616v1

Compressor summary: The paper introduces a web tool that creates custom video summaries for social media platforms using AI models.


Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models

Sungik Choi,Hankook Lee,Honglak Lee,Moontae Lee

http://arxiv.org/abs/2312.02615v1

Compressor summary: PR is an efficient novelty detection method for diffusion models that uses perceptual distance and recursive projections to detect abnormal samples with similar background information to in-distribution data.


Prompt Optimization via Adversarial In-Context Learning

Xuan Long Do,Yiran Zhao,Hannah Brown,Yuxi Xie,James Xu Zhao,Nancy F. Chen,Kenji Kawaguchi,Michael Qizhe Xie,Junxian He

http://arxiv.org/abs/2312.02614v1

Compressor summary: The paper introduces adv-ICL, a new method that optimizes prompts for in-context learning using adversarial learning with pre-trained models, which improves performance on various tasks and is computationally efficient.


A Unified Simulation Framework for Visual and Behavioral Fidelity in Crowd Analysis

Niccolò Bisagno,Nicola Garau,Antonio Luigi Stefani,Nicola Conci

http://arxiv.org/abs/2312.02613v1

Compressor summary: The paragraph describes a human crowd simulator called UniCrowd that can generate annotated data for various computer vision tasks involving crowds.


Privacy-Aware Data Acquisition under Data Similarity in Regression Markets

Shashi Raj Pandey,Pierre Pinson,Petar Popovski

http://arxiv.org/abs/2312.02611v1

Compressor summary: Data similarity and privacy preferences are important for designing data markets, and a new protocol using local differential privacy is proposed to address this issue in a two-party data acquisition mechanism.


Panoptica -- instance-wise evaluation of 3D semantic and instance segmentation maps

Florian Kofler,Hendrik Möller,Josef A. Buchner,Ezequiel de la Rosa,Ivan Ezhov,Marcel Rosier,Isra Mekki,Suprosanna Shit,Moritz Negwer,Rami Al-Maskari,Ali Ertürk,Shankeeth Vinayahalingam,Fabian Isensee,Sarthak Pati,Daniel Rueckert,Jan S. Kirschke,Stefan K. Ehrlich,Annika Reinke,Bjoern Menze,Benedikt Wiestler,Marie Piraud

http://arxiv.org/abs/2312.02608v1

Compressor summary: panoptica is a new Python package that computes various metrics to evaluate 2D and 3D segmentation quality for biomedical applications.


Impact of Tokenization on LLaMa Russian Adaptation

Mikhail Tikhomirov,Daniil Chernyshev

http://arxiv.org/abs/2312.02598v1

Compressor summary: The paper investigates using vocabulary substitution to improve non-English performance and efficiency of large language models, and shows positive results on Russian Super Glue benchmark and human evaluation.


TSVR+: Twin support vector regression with privileged information

Anuradha Kumari,M. Tanveer

http://arxiv.org/abs/2312.02596v1

Compressor summary: The paper introduces TSVR+, a fusion of twin support vector regression with learning using privileged information, which uses both regular and privileged features for training and improves the efficiency of prediction.


FRAPPÉ: A Post-Processing Framework for Group Fairness Regularization

Alexandru Ţifrea,Preethi Lahoti,Ben Packer,Yoni Halpern,Ahmad Beirami,Flavien Prost

http://arxiv.org/abs/2312.02592v1

Compressor summary: The paper presents a new post-processing technique for group fairness that overcomes the limitations of existing methods and achieves similar performance to in-processing approaches.


Text Intimacy Analysis using Ensembles of Multilingual Transformers

Tanmay Chavan,Ved Patwardhan

http://arxiv.org/abs/2312.02590v1

Compressor summary: The paper presents a method for estimating intimacy level in text using multilingual models and various data augmentation techniques, with applications to tweets in multiple languages.


Empathy and Distress Detection using Ensembles of Transformer Models

Tanmay Chavan,Kshitij Deshpande,Sheetal Sonawane

http://arxiv.org/abs/2312.02578v1

Compressor summary: The paper describes an approach for detecting empathy and distress in natural language conversations using BERT-based models and ensemble methods, achieving third place in a shared task.


An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos

Ioannis Kontostathis,Evlampios Apostolidis,Vasileios Mezaris

http://arxiv.org/abs/2312.02576v1

Compressor summary: This paper introduces an integrated system for creating concise summaries of 360-degree videos by detecting important events and using different methods depending on camera movement.


UTBoost: A Tree-boosting based System for Uplift Modeling

Junjie Gao,Xiangyu Zheng,DongDong Wang,Zhixiang Huang,Bangqi Zheng,Kai Yang

http://arxiv.org/abs/2312.02573v1

Compressor summary: Uplift modeling uses machine learning techniques to estimate the net effect of an action on some customer outcome, and this paper proposes two innovative adaptations of the Gradient Boosting Decision Trees algorithm that improve uplift estimation and introduces UTBoost, an open source system for uplift modeling.


Prompt2NeRF-PIL: Fast NeRF Generation via Pretrained Implicit Latent

Jianmeng Liu,Yuyao Zhang,Zeyuan Meng,Yu-Wing Tai,Chi-Keung Tang

http://arxiv.org/abs/2312.02568v1

Compressor summary: The paper presents Prompt2NeRF-PIL, a fast and easy way to generate 3D scenes from text or images using a pre-trained implicit latent space of NeRF parameters, which also speeds up existing prompt-to-NeRF methods.


Structured World Representations in Maze-Solving Transformers

Michael Igorevich Ivanitskiy,Alex F. Spies,Tilman Räuker,Guillaume Corlouer,Chris Mathwin,Lucia Quirke,Can Rager,Rusheb Shah,Dan Valentine,Cecilia Diniz Behn,Katsumi Inoue,Samy Wu Fung

http://arxiv.org/abs/2312.02566v1

Compressor summary: The authors study how small transformer models learn to solve mazes and discover that they form structured representations of the maze topology and paths, as well as identifying specific attention heads for path-following.


DanZero+: Dominating the GuanDan Game through Reinforcement Learning

Youpeng Zhao,Yudong Lu,Jian Zhao,Wengang Zhou,Houqiang Li

http://arxiv.org/abs/2312.02561v1

Compressor summary: The authors develop and evaluate an AI program for the complex card game GuanDan using deep Monte Carlo techniques and policy-based reinforcement learning, achieving superior performance compared to baseline methods.


ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference

Tianchi Cai,Xierui Song,Jiyan Jiang,Fei Teng,Jinjie Gu,Guannan Zhang

http://arxiv.org/abs/2312.02554v1

Compressor summary: The paper proposes a method for aligning language models to user's intent using both supervised fine-tuning and point-wise preference learning, and introduces a new dataset for harmlessness.


DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Thong Nguyen,Xiaobao Wu,Xinshuai Dong,Cong-Duy Nguyen,See-Kiong Ng,Luu Anh Tuan

http://arxiv.org/abs/2312.02549v1

Compressor summary: The paper proposes an energy-based model and a novel Transformer-based architecture to improve localizing video moments corresponding to natural language queries using attention mechanisms.


GeNIe: Generative Hard Negative Images Through Diffusion

Soroush Abbasi Koohpayegani,Anuj Singh,K L Navaneet,Hadi Jamali-Rad,Hamed Pirsiavash

http://arxiv.org/abs/2312.02548v1

Compressor summary: GeNIe is a data augmentation technique that uses diffusion models to generate challenging samples for target categories by merging images from source and target categories, improving deep model training in few-shot and long-tail distribution settings.


Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

Zhuo Huang,Chang Liu,Yinpeng Dong,Hang Su,Shibao Zheng,Tongliang Liu

http://arxiv.org/abs/2312.02546v1

Compressor summary: The paper proposes a method to improve vision models' robustness by using multi-modal language models to provide guidance on correcting noisy predictions in an unsupervised manner.


Graph Information Bottleneck for Remote Sensing Segmentation

Yuntao Shou,Wei Ai,Tao Meng

http://arxiv.org/abs/2312.02545v1

Compressor summary: The paper proposes a simple contrastive vision GNN (SC-ViG) architecture for remote sensing segmentation, which adapts to irregular objects and minimizes task-independent redundant information using information bottleneck theory.