arxiv compressed, 2023-12-08

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2023-12-08 generated by the compressor, my personal LLM-based project.


Scaling Laws of Synthetic Images for Model Training ... for Now

Lijie Fan,Kaifeng Chen,Dilip Krishnan,Dina Katabi,Phillip Isola,Yonglong Tian

http://arxiv.org/abs/2312.04567v1

Compressor summary: This paper investigates how text-to-image models scale when training vision systems using synthetic data and identifies factors that affect this behavior, finding that synthetic images can be effective in certain scenarios but struggle with generating some concepts, limiting their usefulness for supervised image classifiers.


Gen2Det: Generate to Detect

Saksham Suri,Fanyi Xiao,Animesh Sinha,Sean Chang Culatana,Raghuraman Krishnamoorthi,Chenchen Zhu,Abhinav Shrivastava

http://arxiv.org/abs/2312.04566v1

Compressor summary: Gen2Det is a simple pipeline that uses state-of-the-art image generation methods to create synthetic training data for object detection, improving performance on various settings and tasks.


MuRF: Multi-Baseline Radiance Fields

Haofei Xu,Anpei Chen,Yuedong Chen,Christos Sakaridis,Yulun Zhang,Marc Pollefeys,Andreas Geiger,Fisher Yu

http://arxiv.org/abs/2312.04565v1

Compressor summary: MuRF is a method for sparse view synthesis that uses discretized volumes and convolutional networks to produce high-quality images across various baseline settings and scenes.


EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS

Sharath Girish,Kamal Gupta,Abhinav Shrivastava

http://arxiv.org/abs/2312.04564v1

Compressor summary: The authors propose a technique to reduce memory storage requirements for 3D Gaussian splatting in novel-view scene synthesis, achieving faster training and rendering speeds while maintaining visual quality.


Visual Geometry Grounded Deep Structure From Motion

Jianyuan Wang,Nikita Karaev,Christian Rupprecht,David Novotny

http://arxiv.org/abs/2312.04563v1

Compressor summary: The paper proposes a new deep learning pipeline called VGGSfM that reconstructs the camera poses and 3D structure of a scene from unconstrained images in an end-to-end differentiable manner, improving performance on three datasets.


NeRFiller: Completing Scenes via Generative 3D Inpainting

Ethan Weber,Aleksander Hołyński,Varun Jampani,Saurabh Saxena,Noah Snavely,Abhishek Kar,Angjoo Kanazawa

http://arxiv.org/abs/2312.04560v1

Compressor summary: NeRFiller uses 2D visual generative models to complete missing parts of 3D scenes or objects, achieving the most 3D consistent and plausible scene completions.


GenDeF: Learning Generative Deformation Field for Video Generation

Wen Wang,Kecheng Zheng,Qiuyu Wang,Hao Chen,Zifan Shi,Ceyuan Yang,Yujun Shen,Chunhua Shen

http://arxiv.org/abs/2312.04561v1

Compressor summary: The GenDeF method generates videos by warping a static image with a generative deformation field, which improves visual quality, allows for motion modeling, and enables easy video editing and processing.


PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation

Zhaoxi Chen,Fangzhou Hong,Haiyi Mei,Guangcong Wang,Lei Yang,Ziwei Liu

http://arxiv.org/abs/2312.04559v1

Compressor summary: PrimDiffusion is a diffusion-based framework that generates high-quality 3D human models by operating on volumetric primitives, enabling efficient rendering and flexible conditional generation.


MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar

Yufan Chen,Lizhen Wang,Qijing Li,Hongjiang Xiao,Shengping Zhang,Hongxun Yao,Yebin Liu

http://arxiv.org/abs/2312.04558v1

Compressor summary: MonoGaussianAvatar is a novel approach that uses 3D Gaussian points and a deformation field to create realistic head avatars from monocular portrait videos, overcoming the limitations of existing methods.


GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

Shoufa Chen,Mengmeng Xu,Jiawei Ren,Yuren Cong,Sen He,Yanping Xie,Animesh Sinha,Ping Luo,Tao Xiang,Juan-Manuel Perez-Rua

http://arxiv.org/abs/2312.04557v1

Compressor summary: GenTron is a family of generative models using Transformer-based diffusion that improves visual quality and can generate videos from text, achieving high win rates in human evaluations.


Large Language Models for Mathematicians

Simon Frieder,Julius Berner,Philipp Petersen,Thomas Lukasiewicz

http://arxiv.org/abs/2312.04556v1

Compressor summary: The note discusses how large language models can help professional mathematicians by explaining their structure, assessing their mathematical skills, and exploring their impact on the field.


Improved Visual Grounding through Self-Consistent Explanations

Ruozhen He,Paola Cascante-Bonilla,Ziyan Yang,Alexander C. Berg,Vicente Ordonez

http://arxiv.org/abs/2312.04554v1

Compressor summary: The authors propose SelfEQ, a method that improves object localization in vision-and-language models by generating paraphrases and finetuning for self-consistent visual explanations.


SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing

Tomoki Ichikawa,Shohei Nobuhara,Ko Nishino

http://arxiv.org/abs/2312.04553v1

Compressor summary: SPIDeRS uses polarized patterns of light to capture depth, surface normals, and reflectance of objects invisibly for applications in vision, xR, robotics, and HCI.


Generating Illustrated Instructions

Sachit Menon,Ishan Misra,Rohit Girdhar

http://arxiv.org/abs/2312.04552v1

Compressor summary: The authors introduce a task called Illustrated Instructions, which generates custom visual instructions based on text input, and propose a new model called StackedDiffusion that outperforms existing methods and enables personalized applications.


Free3D: Consistent Novel View Synthesis without 3D Representation

Chuanxia Zheng,Andrea Vedaldi

http://arxiv.org/abs/2312.04551v1

Compressor summary: Free3D is a novel view synthesis method that uses a 2D image generator fine-tuned with ray conditioning normalization and multi-view attention to achieve better pose encoding and consistency without needing a 3D representation.


Multiview Aerial Visual Recognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?

Aritra Dutta,Srijan Das,Jacob Nielsen,Rajatsubhra Chakraborty,Mubarak Shah

http://arxiv.org/abs/2312.04548v1

Compressor summary: MAVREC is a large, diverse video dataset with ground and aerial views for improving object detection in aerial images.


Digital Life Project: Autonomous 3D Characters with Social Intelligence

Zhongang Cai,Jianping Jiang,Zhongfei Qing,Xinying Guo,Mingyuan Zhang,Zhengyu Lin,Haiyi Mei,Chen Wei,Ruisi Wang,Wanqi Yin,Xiangyu Fan,Han Du,Liang Pan,Peng Gao,Zhitao Yang,Yang Gao,Jiaqi Li,Tianxiang Ren,Yukun Wei,Xiaogang Wang,Chen Change Loy,Lei Yang,Ziwei Liu

http://arxiv.org/abs/2312.04547v1

Compressor summary: The Digital Life Project is a framework that creates autonomous 3D characters with realistic social interactions and body movements using language and motion synthesis techniques.


Adversarial Learning for Feature Shift Detection and Correction

Miriam Barrabes,Daniel Mas Montserrat,Margarita Geleta,Xavier Giro-i-Nieto,Alexander G. Ioannidis

http://arxiv.org/abs/2312.04546v1

Compressor summary: The text describes a method for detecting and fixing data shifts using adversarial learning with supervised classifiers and iterative heuristics, which outperforms existing techniques.


HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image

Tong Wu,Zhibing Li,Shuai Yang,Pan Zhang,Xinggang Pan,Jiaqi Wang,Dahua Lin,Ziwei Liu

http://arxiv.org/abs/2312.04543v1

Compressor summary: HyperDreamer is a new method for creating realistic and editable 3D models from a single image using advanced techniques for viewing, rendering, and editing.


Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations

Yuejiang Liu,Ahmad Rahimi,Po-Chien Luan,Frano Rajič,Alexandre Alahi

http://arxiv.org/abs/2312.04540v1

Compressor summary: The authors study how to represent causal relationships in multi-agent systems, propose a metric learning approach for causal awareness, and demonstrate its effectiveness on pedestrian datasets.


Self-Guided Open-Vocabulary Semantic Segmentation

Osman Ülger,Maksymilian Kulicki,Yuki Asano,Martin R. Oswald

http://arxiv.org/abs/2312.04539v1

Compressor summary: The paper presents a novel framework called Self-Seg that uses VLMs to perform open-vocabulary image segmentation without textual input, achieving state-of-the-art results on several datasets.


Trajeglish: Learning the Language of Driving Scenarios

Jonah Philion,Xue Bin Peng,Sanja Fidler

http://arxiv.org/abs/2312.04535v1

Compressor summary: Key points: - The paper proposes a method to simulate dynamic driving scenarios using discrete sequence modeling - The method discretizes trajectories to centimeter-level resolution and models multi-agent interactions with an encoder-decoder - The method achieves state-of-the-art realism and outperforms prior work on benchmarks - The method can be adapted to improve performance on other datasets and evaluated for scalability and saliency Summary: The paper presents a data-driven, tokenized, and encoder-decoder based method to simulate realistic and interactive driving scenarios, which improves self-driving development and can be applied to different tasks.


PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns

Shuliang Ning,Duomin Wang,Yipeng Qin,Zirong Jin,Baoyuan Wang,Xiaoguang Han

http://arxiv.org/abs/2312.04534v1

Compressor summary: The paper introduces ucVTON, a novel method for realistic synthesis of personalized clothing on human images, allowing flexible specification of style and texture conditions, and enabling superior quality and user experience in virtual try-on applications.


Camera Height Doesn't Change: Unsupervised Monocular Scale-Aware Road-Scene Depth Estimation

Genki Kinoshita,Ko Nishino

http://arxiv.org/abs/2312.04530v1

Compressor summary: Key points: - Monocular depth estimators need scale supervision or suffer from ambiguity - StableCamH is a novel scale-aware method that uses object heights and camera height - StableCamH does not require auxiliary sensors or supervision - StableCamH has a learning-based size prior for car appearance - StableCamH achieves state-of-the-art accuracy and generalizability Summary: StableCamH is a scale-aware monocular depth estimation method that uses object heights and camera height without auxiliary sensors or supervision. It has a learning-based size prior for cars and outperforms related methods.


Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance

Yuto Enyo,Ko Nishino

http://arxiv.org/abs/2312.04529v1

Compressor summary: The paper introduces DRMNet, a stochastic inverse rendering method that recovers the full frequency spectrum of illumination and object reflectance from a single image using a diffusion model.


Using Large Language Models for Hyperparameter Optimization

Michael R. Zhang,Nishkrit Desai,Juhan Bae,Jonathan Lorraine,Jimmy Ba

http://arxiv.org/abs/2312.04528v1

Compressor summary: The paper shows how large language models can help improve hyperparameter optimization efficiency by generating code and making better decisions with limited search budgets.


Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

Kohei Yamashita,Vincent Lepetit,Ko Nishino

http://arxiv.org/abs/2312.04527v1

Compressor summary: The paper introduces reflection correspondences, a new type of correspondence that helps estimate camera pose without relying on the background, and proposes methods for using all three kinds of correspondences for robust object shape estimation.


RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Ozgur Kara,Bariscan Kurtkaya,Hidir Yesiltepe,James M. Rehg,Pinar Yanardag

http://arxiv.org/abs/2312.04524v1

Compressor summary: RAVE is a zero-shot video editing method that uses text-to-image diffusion models to create high-quality, temporally consistent, and semantically preserved videos with various edits and efficient memory requirements.


Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping

Alex Costanzino,Pierluigi Zama Ramirez,Giuseppe Lisanti,Luigi Di Stefano

http://arxiv.org/abs/2312.04521v1

Compressor summary: The paper presents a fast framework for anomaly detection using point clouds and RGB images by learning feature mapping between modalities and detecting inconsistencies, achieving state-of-the-art results and improving efficiency with layer pruning.


Bootstrapping Autonomous Radars with Self-Supervised Learning

Yiduo Hao,Sohrab Madani,Junfeng Guan,Mohammed Alloulah,Saurabh Gupta,Haitham Hassanieh

http://arxiv.org/abs/2312.04519v1

Compressor summary: The paper proposes a self-supervised learning method to train radar models for autonomous vehicles using unlabeled data, improving object detection accuracy.


Efficient Monotonic Multihead Attention

Xutai Ma,Anna Sun,Siqi Ouyang,Hirofumi Inaguma,Paden Tomasello

http://arxiv.org/abs/2312.04515v1

Compressor summary: EMMA is a new translation model that improves monotonic alignment estimation, training, and inference, achieving top performance in speech-to-text translation for Spanish and English.


An LLM Compiler for Parallel Function Calling

Sehoon Kim,Suhong Moon,Ryan Tabrizi,Nicholas Lee,Michael W. Mahoney,Kurt Keutzer,Amir Gholami

http://arxiv.org/abs/2312.04511v1

Compressor summary: LLMCompiler is a tool that improves the efficiency and accuracy of multi-function calling in large language models by executing functions in parallel using classical compiler principles.


A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation

Jarad Forristal,Niloofar Mireshghallah,Greg Durrett,Taylor Berg-Kirkpatrick

http://arxiv.org/abs/2312.04510v1

Compressor summary: The paper proposes a new Markov Chain (MC) sampler for energy-based language models that can generate longer texts by iteratively prompting a large language model, improving both efficiency and accuracy in controlled text generation tasks.


Graph Metanetworks for Processing Diverse Neural Architectures

Derek Lim,Haggai Maron,Marc T. Law,Jonathan Lorraine,James Lucas

http://arxiv.org/abs/2312.04501v1

Compressor summary: The paper introduces Graph Metanetworks (GMNs), a generalizable method for processing graphs representing input neural networks, which can handle various neural architectures and are expressive and equivariant to parameter permutation symmetries.


FRNet: Frustum-Range Networks for Scalable LiDAR Segmentation

Xiang Xu,Lingdong Kong,Hui Shuai,Qingshan Liu

http://arxiv.org/abs/2312.04484v1

Compressor summary: FRNet restores contextual information in range-view LiDAR segmentation using frustum-based feature extraction and fusion, achieving competitive performance with high efficiency.


Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Zhiwu Qing,Shiwei Zhang,Jiayu Wang,Xiang Wang,Yujie Wei,Yingya Zhang,Changxin Gao,Nong Sang

http://arxiv.org/abs/2312.04483v1

Compressor summary: HiGen is a diffusion model-based method that improves text-to-video generation by decoupling spatial and temporal factors, leading to more realistic and diverse videos with semantics accuracy and motion stability.


GSGFormer: Generative Social Graph Transformer for Multimodal Pedestrian Trajectory Prediction

Zhongchang Luo,Marion Robin,Pavan Vasishta

http://arxiv.org/abs/2312.04479v1

Compressor summary: GSGFormer is a new generative model that predicts pedestrian trajectories by considering complex interactions between pedestrians and their environment, offering diverse behavioral modalities and performing well even with limited data.


Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Chengshu Li,Jacky Liang,Andy Zeng,Xinyun Chen,Karol Hausman,Dorsa Sadigh,Sergey Levine,Li Fei-Fei,Fei Xia,Brian Ichter

http://arxiv.org/abs/2312.04474v1

Compressor summary: Chain of Code is a method to improve language models' ability to reason by having them write and emulate code for various linguistic tasks, leading to better performance on reasoning benchmarks.


On the Learnability of Watermarks for Language Models

Chenchen Gu,Xiang Lisa Li,Percy Liang,Tatsunori Hashimoto

http://arxiv.org/abs/2312.04469v1

Compressor summary: The paper proposes watermark distillation, a method for teaching models to generate watermarked text with high detectability, and explores its limitations.


Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Kiran Chhatre,Radek Daněček,Nikos Athanasiou,Giorgio Becherini,Christopher Peters,Michael J. Black,Timo Bolkart

http://arxiv.org/abs/2312.04466v1

Compressor summary: AMUSE is a model that generates realistic 3D human gestures from speech, controlling for content, emotion, and style.


FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models

Stathis Galanakis,Alexandros Lattas,Stylianos Moschoglou,Stefanos Zafeiriou

http://arxiv.org/abs/2312.04465v1

Compressor summary: FitDiff is a diffusion-based 3D face model that uses a 2D image to generate realistic and relightable avatars with high performance.


Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation

Jiayi Huang,Han Zhong,Liwei Wang,Lin F. Yang

http://arxiv.org/abs/2312.04464v1

Compressor summary: UCRL-WVTR is an algorithm for reinforcement learning that eliminates the planning horizon, achieves sharp regret bounds, and is computationally efficient.


PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Zhen Li,Mingdeng Cao,Xintao Wang,Zhongang Qi,Ming-Ming Cheng,Ying Shan

http://arxiv.org/abs/2312.04461v1

Compressor summary: PhotoMaker is a fast text-to-image generation method that preserves identity information by encoding multiple input images into a unified ID representation, enabling various applications.


Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use

Yuhan Chen,Ang Lv,Ting-En Lin,Changyu Chen,Yuchuan Wu,Fei Huang,Yongbin Li,Rui Yan

http://arxiv.org/abs/2312.04455v1

Compressor summary: The paper introduces Attention Buckets, a method that improves large language models' tool use performance by shaping their attention waveform with multiple processes and angles.


OpenAsp: A Benchmark for Multi-document Open Aspect-based Summarization

Shmuel Amar,Liat Schiff,Ori Ernst,Asi Shefer,Ori Shapira,Ido Dagan

http://arxiv.org/abs/2312.04440v1

Compressor summary: The paper introduces OpenAsp, a benchmark dataset for multi-document aspect-based summarization, created from existing datasets using a novel annotation protocol.


DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

Yujie Wei,Shiwei Zhang,Zhiwu Qing,Hangjie Yuan,Zhiheng Liu,Yu Liu,Yingya Zhang,Jingren Zhou,Hongming Shan

http://arxiv.org/abs/2312.04433v1

Compressor summary: DreamVideo is a method to generate personalized videos from static images and motion videos by learning subject appearance and target motion patterns using textual inversion, fine-tuning, and adapters.


Approximate Caching for Efficiently Serving Diffusion Models

Shubham Agarwal,Subrata Mitra,Sarthak Chakraborty,Srikrishna Karanam,Koyel Mukherjee,Shiv Saini

http://arxiv.org/abs/2312.04429v1

Compressor summary: The paper introduces approximate-caching, a technique that reduces resource consumption and latency in text-to-image generation using diffusion models by reusing intermediate noise states for similar prompts.


Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

Yabo Chen,Jiemin Fang,Yuyang Huang,Taoran Yi,Xiaopeng Zhang,Lingxi Xie,Xinggang Wang,Wenrui Dai,Hongkai Xiong,Qi Tian

http://arxiv.org/abs/2312.04424v1

Compressor summary: The authors propose a cascade generation framework called Cascade-Zero123 that uses two Zero-1-to-3 models to generate multi-view 3D images from one single image, addressing the challenges of geometric and visual consistency across views for complex objects.


Scalable Knowledge Graph Construction and Inference on Human Genome Variants

Shivika Prasanna,Deepthi Rao,Eduardo Simoes,Praveen Rao

http://arxiv.org/abs/2312.04423v1

Compressor summary: The text describes how variant-level information from RNA-sequences of COVID-19 patients was represented as a large, scalable knowledge graph, which was used for analysis and inference tasks.


Monitoring Sustainable Global Development Along Shared Socioeconomic Pathways

Michelle W. L. Wan,Jeffrey N. Clark,Edward A. Small,Elena Fillola Mayoral,Raúl Santos-Rodríguez

http://arxiv.org/abs/2312.04416v1

Compressor summary: The authors propose methods to measure and track sustainable development using data integration and machine learning.


Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

Jiayi Guo,Xingqian Xu,Yifan Pu,Zanlin Ni,Chaofei Wang,Manushree Vasu,Shiji Song,Gao Huang,Humphrey Shi

http://arxiv.org/abs/2312.04410v1

Compressor summary: The paper proposes Smooth Diffusion, a new category of diffusion models that improve latent space smoothness for better text-to-image generation and other downstream tasks.


On the Impact of Multi-dimensional Local Differential Privacy on Fairness

karima Makhlouf,Heber H. Arcolezi,Sami Zhioua,Ghassen Ben Brahim,Catuscia Palamidessi

http://arxiv.org/abs/2312.04404v1

Compressor summary: This paper studies how local differential privacy (LDP) affects fairness when multiple sensitive attributes are used, and provides recommendations for balancing privacy, fairness, and utility in machine learning applications.


OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

Dongchen Han,Xiaojun Jia,Yang Bai,Jindong Gu,Yang Liu,Xiaochun Cao

http://arxiv.org/abs/2312.04403v1

Compressor summary: The paper proposes a new method, OT-Attack, to generate high-transferability adversarial examples for VLP models by optimizing the alignment between data-augmented image and text pairs using optimal transport theory.


Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning

Yongqi Dong,Xingmin Lu,Ruohan Li,Wei Song,Bart van Arem,Haneen Farah

http://arxiv.org/abs/2312.04398v1

Compressor summary: The paper proposes a four-phase pipeline to detect anomalies in lane rendering maps using self-supervised pre-training with MiM, customized fine-tuning, and post-processing, improving accuracy and efficiency.


PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction

Yinhuai Wang,Jing Lin,Ailing Zeng,Zhengyi Luo,Jian Zhang,Lei Zhang

http://arxiv.org/abs/2312.04393v1

Compressor summary: The text describes a new approach called PhysHOI for teaching humanoid robots to imitate human-object interaction using physics-based models and contact graph rewards without task-specific rewards, as well as introducing a dataset of basketball skills for testing the approach.


Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization

Carlos E. Luis,Alessandro G. Bottero,Julia Vinogradska,Felix Berkenkamp,Jan Peters

http://arxiv.org/abs/2312.04386v1

Compressor summary: The paper proposes a new uncertainty Bellman equation for model-based reinforcement learning that improves exploration and policy optimization, and introduces QU-SAC, an algorithm that can handle risk-seeking or risk-averse objectives.


How much informative is your XAI? A decision-making assessment task to objectively measure the goodness of explanations

Marco Matarese,Francesco Rea,Alessandra Sciutti

http://arxiv.org/abs/2312.04379v1

Compressor summary: The paper proposes an assessment task to measure and compare the information power of XAI systems in user-centred approaches, which could improve interaction between users and systems.


LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

Yunsheng Ma,Can Cui,Xu Cao,Wenqian Ye,Peiran Liu,Juanwu Lu,Amr Abdelraouf,Rohit Gupta,Kyungtae Han,Aniket Bera,James M. Rehg,Ziran Wang

http://arxiv.org/abs/2312.04372v1

Compressor summary: LaMPilot is a framework for autonomous driving that uses code generation with behavioral primitives to handle user instructions, and evaluates LLMs on a custom benchmark with GPT-4 achieving high performance.


SingingHead: A Large-scale 4D Dataset for Singing Head Animation

Sijing Wu,Yunhao Li,Weitian Zhang,Jun Jia,Yucheng Zhu,Yichao Yan,Guangtao Zhai

http://arxiv.org/abs/2312.04369v1

Compressor summary: The paper introduces SingingHead, a large dataset for singing head animation, and UniSinger, a framework that uses it to achieve both 3D and 2D facial animation for singing.


DemoCaricature: Democratising Caricature Generation with a Rough Sketch

Dar-Yen Chen,Subhadeep Koley,Aneeshan Sain,Pinaki Nath Chowdhury,Tao Xiang,Ayan Kumar Bhunia,Yi-Zhe Song

http://arxiv.org/abs/2312.04364v1

Compressor summary: The paper introduces methods for generating personalized caricatures from photos and sketches, balancing abstraction and identity while preserving creativity.


PCoQA: Persian Conversational Question Answering Dataset

Hamed Hematian Hemati,Atousa Toghyani,Atena Souri,Sayed Hesam Alavian,Hossein Sameti,Hamid Beigy

http://arxiv.org/abs/2312.04362v1

Compressor summary: The paragraph introduces PCoQA, a Persian conversational question answering dataset with challenges like open-ended non-factual answers and longer answers.


CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models

Zhijing Jin,Yuen Chen,Felix Leeb,Luigi Gresele,Ojasv Kamal,Zhiheng Lyu,Kevin Blin,Fernando Gonzalez Adauto,Max Kleiman-Weiner,Mrinmaya Sachan,Bernhard Schölkopf

http://arxiv.org/abs/2312.04350v1

Compressor summary: The authors propose a new natural language processing task to evaluate whether large language models can perform causal inference using formal rules, and present a challenging dataset and prompting strategy for this purpose.


Improved Efficient Two-Stage Denoising Diffusion Power System Measurement Recovery Against False Data Injection Attacks and Data Losses

Jianhua Pei,Jingyu Wang,Dongyuan Shi,Ping Wang

http://arxiv.org/abs/2312.04346v1

Compressor summary: The paper proposes a two-stage denoising diffusion model for accurate power system measurement recovery despite various uncertainties and complex dynamics, with improved efficiency and robustness.


Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on Prompt Engineering Strategies

Pengcheng Chen,Ziyan Huang,Zhongying Deng,Tianbin Li,Yanzhou Su,Haoyu Wang,Jin Ye,Yu Qiao,Junjun He

http://arxiv.org/abs/2312.04344v1

Compressor summary: The paper examines how to improve GPT-4V's medical imaging interpretation skills using prompt engineering techniques, leading to more reliable and valuable insights for healthcare.


Causality and Explainability for Trustworthy Integrated Pest Management

Ilias Tsoumas,Vasileios Sitokonstantinou,Georgios Giannarakis,Evagelia Lampiri,Christos Athanassiou,Gustau Camps-Valls,Charalampos Kontoes,Ioannis Athanasiadis

http://arxiv.org/abs/2312.04343v1

Compressor summary: The authors propose an advanced data analysis framework to help farmers adopt Integrated Pest Management (IPM) practices by providing accurate pest predictions, interpretable advice, and effective assessments.


Merging by Matching Models in Task Subspaces

Derek Tam,Mohit Bansal,Colin Raffel

http://arxiv.org/abs/2312.04339v1

Compressor summary: The authors propose a new method called MaTS for merging models by matching them based on their task subspace, which improves performance and allows solving intractable problems with various initializations and estimates.


Multi-View Unsupervised Image Generation with Cross Attention Guidance

Llukman Cerkezi,Aram Davtyan,Sepehr Sameni,Paolo Favaro

http://arxiv.org/abs/2312.04337v1

Compressor summary: The paper presents a new method for unsupervised training of a diffusion model that can synthesize novel views from single-category datasets using object poses identified by clustering and cross-view consistency ensured by hard-attention guidance, achieving state-of-the-art results on real and synthetic images.


Towards a Perceptual Evaluation Framework for Lighting Estimation

Justine Giroux,Mohammad Reza Karimi Dastjerdi,Yannick Hold-Geoffroy,Javier Vazquez-Corral,Jean-François Lalonde

http://arxiv.org/abs/2312.04334v1

Compressor summary: The authors propose a psychophysical experiment to measure human preference for relit virtual scenes and show that existing image quality assessment metrics do not capture human perception, but a combination of them can improve the evaluation of lighting estimation algorithms.


Beyond Surface: Probing LLaMA Across Scales and Layers

Nuo Chen,Ning Wu,Shining Liang,Ming Gong,Linjun Shou,Dongmei Zhang,Jia Li

http://arxiv.org/abs/2312.04333v1

Compressor summary: The paper analyzes LLaMA, a natural language processing model, using multiple-choice tasks to measure its understanding in reasoning and computation, finding that larger sizes improve reasoning but not knowledge, and lower layers lack arithmetic and facts while upper layers have more computational power and real-world knowledge.


Surrogate Modelling for Sea Ice Concentration using Lightweight Neural Ensemble

Julia Borisova,Nikolay O. Nikitin

http://arxiv.org/abs/2312.04330v1

Compressor summary: LANE-SI is an adaptive deep learning model that forecasts sea ice concentration in the Arctic, achieving comparable or better results than existing physical models.


A Multi-scale Information Integration Framework for Infrared and Visible Image Fusion

Guang Yang,Jie Li,Hanxiao Lei,Xinbo Gao

http://arxiv.org/abs/2312.04328v1

Compressor summary: The authors propose a multi-scale dual attention framework for fusing infrared and visible images, which measures and integrates complementary information at different scales using structure and loss function, and achieves robust and informative results across scenarios.


Learning to sample in Cartesian MRI

Thomas Sanchez

http://arxiv.org/abs/2312.04327v1

Compressor summary: The thesis proposes two algorithms for accelerating MRI acquisition and improving image quality, focusing on Cartesian MRI techniques and comparing them with deep learning methods.


iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design

Ruyi Gan,Xiaojun Wu,Junyu Lu,Yuanhe Tian,Dixiang Zhang,Ziwei Wu,Renliang Sun,Chang Liu,Jiaxing Zhang,Pingjian Zhang,Yan Song

http://arxiv.org/abs/2312.04326v1

Compressor summary: The paper presents a text-to-image model for interior design that uses curriculum learning and reinforcement learning to improve prompt-following capabilities and generate high-quality images based on textual descriptions.


MIMo: A Multi-Modal Infant Model for Studying Cognitive Development

Dominik Mattern,Pierre Schumacher,Francisco M. López,Marcel C. Raabe,Markus R. Ernst,Arthur Aubret,Jochen Triesch

http://arxiv.org/abs/2312.04318v1

Compressor summary: The paragraph discusses an open-source multi-modal infant model called MIMo, which simulates early human cognitive development through embodied interactions with the physical and social environment.


GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Zuyao Chen,Jinlin Wu,Zhen Lei,Zhaoxiang Zhang,Changwen Chen

http://arxiv.org/abs/2312.04314v1

Compressor summary: The authors propose a new method, GPT4SGG, to generate scene graphs from detailed narratives based on images, which improves upon traditional language parsing and localization methods for scene graph generation.


Finding Interpretable Class-Specific Patterns through Efficient Neural Search

Nils Philipp Walter,Jonas Fischer,Jilles Vreeken

http://arxiv.org/abs/2312.04311v1

Compressor summary: The proposed binary neural network architecture DIFFNAPS can extract differential patterns from high-dimensional data in a scalable and interpretable way, improving the understanding of cellular processes and potentially leading to novel treatments.


A Structural-Clustering Based Active Learning for Graph Neural Networks

Ricky Maulana Fajri,Yulong Pei,Lu Yin,Mykola Pechenizkiy

http://arxiv.org/abs/2312.04307v1

Compressor summary: The Structural-Clustering PageRank method for improved Active learning (SPA) is a simple and effective approach to select informative and central nodes from graph-structured data using community detection and PageRank scoring.


nerblackbox: A High-level Library for Named Entity Recognition in Python

Felix Stollenwerk

http://arxiv.org/abs/2312.04306v1

Compressor summary: nerblackbox is a python library that simplifies using transformer-based models for named entity recognition, offering various options for training, evaluation, and inference.


Prompt Highlighter: Interactive Control for Multi-Modal LLMs

Yuechen Zhang,Shengju Qian,Bohao Peng,Shu Liu,Jiaya Jia

http://arxiv.org/abs/2312.04302v1

Compressor summary: The study introduces Prompt Highlighter, a method to control text generation from multi-modal LLMs by highlighting specific prompt spans for focused and customized output.


Cross-codex Learning for Reliable Scribe Identification in Medieval Manuscripts

Julius Weißmann,Markus Seidl,Anya Dietrich,Martin Haltrich

http://arxiv.org/abs/2312.04296v1

Compressor summary: The paper shows how using cross-codex training data and neural networks can improve scribe identification from historic manuscripts, allowing for more accurate and efficient paleographic analysis.


Estimating Countries with Similar Maternal Mortality Rate using Cluster Analysis and Pairing Countries with Identical MMR

S. Nandini,Sanjjushri Varshini R

http://arxiv.org/abs/2312.04275v1

Compressor summary: The text discusses how machine learning can be used to analyze maternal mortality rates in different countries and identify similarities and differences in their factors affecting these rates.


Invariant Random Forest: Tree-Based Model Solution for OOD Generalization

Yufan Liao,Qi Wu,Xing Yan

http://arxiv.org/abs/2312.04273v1

Compressor summary: The paper proposes Invariant Decision Tree (IDT) and Invariant Random Forest (IRF), novel methods for out-of-distribution generalization in decision tree models, motivated by theory and validated by experiments.


Activity Grammars for Temporal Action Segmentation

Dayoung Gong,Joonseok Lee,Deunsol Jung,Suha Kwak,Minsu Cho

http://arxiv.org/abs/2312.04266v1

Compressor summary: The paper introduces an activity grammar to help neural networks predict actions from videos more accurately and understandably.


Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Zhixiang Wei,Lin Chen,Yi Jin,Xiaoxiao Ma,Tianle Liu,Pengyang Lin,Ben Wang,Huaian Chen,Jinjin Zheng

http://arxiv.org/abs/2312.04265v1

Compressor summary: The paper introduces Rein, a robust fine-tuning method that uses fewer trainable parameters to improve semantic segmentation with pre-trained vision models, achieving state-of-the-art results.


PsyChat: A Client-Centric Dialogue System for Mental Health Support

Huachuan Qiu,Anqi Li,Lizhi Ma,Zhenzhong Lan

http://arxiv.org/abs/2312.04262v1

Compressor summary: PsyChat is a client-centric dialogue system that provides psychological support through online chat by recognizing client behaviors and generating appropriate responses.


Extending Answer Set Programming with Rational Numbers

Francesco Pacenza,Jessica Zangari

http://arxiv.org/abs/2312.04249v1

Compressor summary: The paper proposes an extension to Answer Set Programming (ASP) that approximates non-integers with rational numbers, improving its ability to model real-world data and information while preserving declarativity and reproducibility.


TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes

Xuying Zhang,Bo-Wen Yin,Yuming Chen,Zheng Lin,Yunheng Li,Qibin Hou,Ming-Ming Cheng

http://arxiv.org/abs/2312.04248v1

Compressor summary: TeMO is a novel framework that uses Decoupled Graph Attention and Cross-Grained Contrast supervision to style multiple objects in 3D scenes.


Detecting and Restoring Non-Standard Hands in Stable Diffusion Generated Images

Yiqun Zhang,Zhenyue Qin,Yang Liu,Dylan Campbell

http://arxiv.org/abs/2312.04236v1

Compressor summary: The authors present a method to correct anatomical errors in hand images generated by Stable Diffusion using a specialized dataset, detection, pose estimation, ControlNet, and InstructPix2Pix.


Graph Convolutions Enrich the Self-Attention in Transformers!

Jeongwhan Choi,Hyowon Wi,Jayoung Kim,Yehjin Shin,Kookjin Lee,Nathaniel Trask,Noseong Park

http://arxiv.org/abs/2312.04234v1

Compressor summary: The authors propose a new self-attention mechanism called graph-filter-based self-attention (GFSA) that improves Transformer performance across different tasks by addressing the oversmoothing problem.


Fine-tune vision foundation model for crack segmentation in civil infrastructures

Kang Ge,Chen Wang,Yutao Guo

http://arxiv.org/abs/2312.04233v1

Compressor summary: The authors propose CrackSAM, a large foundation model fine-tuned for crack segmentation using two efficient methods, and show its excellent performance on two unique datasets and challenging conditions.


Adventures of Trustworthy Vision-Language Models: A Survey

Mayank Vatsa,Anubhooti Jain,Richa Singh

http://arxiv.org/abs/2312.04231v1

Compressor summary: This paper examines vision-language transformers using BRI principles to improve their trustworthiness and accountability in various applications.


TLCE: Transfer-Learning Based Classifier Ensembles for Few-Shot Class-Incremental Learning

Shuangmei Wang,Yang Cao,Tieru Wu

http://arxiv.org/abs/2312.04225v1

Compressor summary: TLCE is a method that uses multiple pre-trained models and episodic training to recognize new classes without forgetting old ones or overfitting, achieving better results than existing few-shot class-incremental learning approaches.


Swap distance minimization in SOV languages. Cognitive and mathematical foundations

Ramon Ferrer-i-Cancho,Savithry Namboodiripad

http://arxiv.org/abs/2312.04219v1

Compressor summary: The paragraph discusses the principle of swap distance minimization in word order variations and its cognitive underpinning, and tests it on three flexible order SOV languages.


CODEX: A Cluster-Based Method for Explainable Reinforcement Learning

Timothy K. Mathes,Jessica Inman,Andrés Colón,Simon Khan

http://arxiv.org/abs/2312.04216v1

Compressor summary: The paper introduces CODEX, a method that uses semantic clustering to summarize RL agent behavior in state-action space, making it easier to explain and build trust in high-risk applications.


Constraint Model for the Satellite Image Mosaic Selection Problem

Manuel Combarro Simón,Pierre Talbot,Grégoire Danoy,Jedrzej Musial,Mohammed Alswaitti,Pascal Bouvry

http://arxiv.org/abs/2312.04210v1

Compressor summary: Key points: - Satellite image mosaic selection problem is a challenge when optimizing multiple parameters - The input includes area of interest, satellite images, requirements, and objectives - The authors propose a new dataset and two models to solve the problem Summary: The paper presents a new problem of selecting satellite images to create mosaics that meet various criteria, and proposes two models and a realistic dataset to address it.


Constrained Hierarchical Clustering via Graph Coarsening and Optimal Cuts

Eliabelle Mauduit,Andrea Simonetto

http://arxiv.org/abs/2312.04209v1

Compressor summary: The paragraph discusses a method for clustering words with both horizontal and vertical constraints using a two-step algorithm that combines soft constraints, graph coarsening, and optimal cut heights.


SAMBA: A Trainable Segmentation Web-App with Smart Labelling

Ronan Docherty,Isaac Squires,Antonis Vamvakeros,Samuel J. Cooper

http://arxiv.org/abs/2312.04197v1

Compressor summary: SAMBA is a web-based trainable segmentation tool for materials science images that uses SAM for label suggestions and a random forest classifier for robust segmentations.


Language Model Knowledge Distillation for Efficient Question Answering in Spanish

Adrián Bazaga,Pietro Liò,Gos Micklem

http://arxiv.org/abs/2312.04193v1

Compressor summary: The authors develop a smaller, efficient Spanish language model for question answering based on knowledge distillation from a larger model.


Joint-Individual Fusion Structure with Fusion Attention Module for Multi-Modal Skin Cancer Classification

Peng Tang,Xintong Yan,Yang Nan,Xiaobin Hu,Xiaobin Hu,Bjoern H Menzee. Sebastian Krammer,Tobias Lasser

http://arxiv.org/abs/2312.04189v1

Compressor summary: The paper proposes a new fusion method combining dermatological images and patient metadata for skin cancer classification using a joint-individual fusion structure and a fusion attention module, which improves accuracy over existing methods.


AI and Jobs: Has the Inflection Point Arrived? Evidence from an Online Labor Platform

Dandan Qiao,Huaxia Rui,Qian Xiong

http://arxiv.org/abs/2312.04180v1

Compressor summary: The text discusses how artificial intelligence performance affects human workers' jobs in different occupations and proposes a framework to analyze the impact of AI on employment.


A novel feature selection framework for incomplete data

Cong Guo

http://arxiv.org/abs/2312.04171v1

Compressor summary: The paper proposes a new framework for selecting features on incomplete datasets that considers feature importance in the imputation process and uses an improved reliefF algorithm to learn the feature importance vector.


Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation

Jiawei Fan,Chao Li,Xiaolong Liu,Meina Song,Anbang Yao

http://arxiv.org/abs/2312.04168v1

Compressor summary: Af-DCD is a new contrastive learning method for semantic segmentation that improves efficiency and accuracy by using masked features and feature partitions without data augmentation or memory buffer.


Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation

Xiaoyu Lin,Laurent Girin,Xavier Alameda-Pineda

http://arxiv.org/abs/2312.04167v1

Compressor summary: The paper introduces MixDVAE, a latent-variable generative model for multi-source dynamics estimation, and demonstrates its effectiveness on computer vision and audio processing tasks.


Text as Image: Learning Transferable Adapter for Multi-Label Classification

Xuelin Zhu,Jiuxin Cao,Jian liu,Dongqi Tang,Furong Xu,Weijia Liu,Jiawei Ge,Bo Liu,Qingpei Guo,Tianyi Zhang

http://arxiv.org/abs/2312.04160v1

Compressor summary: The authors propose a method to improve vision-language pre-trained models for multi-label image classification by using an adapter network with random perturbation and large language models for text generation, enabling automated visual label recognition.


EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer

Fei Wang,Dan Guo,Kun Li,Meng Wang

http://arxiv.org/abs/2312.04152v1

Compressor summary: The paper proposes a novel dynamic filtering strategy for video motion magnification that separates texture and shape, eliminates noise, and preserves critical features using a global dynamic sparse cross-covariance attention mechanism and a multi-scale dual-path gating mechanism.


Diffusing Colors: Image Colorization with Text Guided Diffusion

Nir Zabari,Aharon Azulay,Alexey Gorkor,Tavi Halperin,Ohad Fried

http://arxiv.org/abs/2312.04145v1

Compressor summary: The paper proposes a new method for colorizing grayscale images using diffusion techniques and text prompts, improving both visual quality and user control over the process.


Towards 4D Human Video Stylization

Tiantian Wang,Xinxin Zuo,Fangzhou Mu,Jian Wang,Ming-Hsuan Yang

http://arxiv.org/abs/2312.04143v1

Compressor summary: The paper proposes a method for stylizing human videos in 4D (3D and time) by using NeRFs to represent both the person and their surroundings, allowing for animation across poses and viewpoints.


TimeDRL: Disentangled Representation Learning for Multivariate Time-Series

Ching Chang,Chiao-Tung Chan,Wei-Yao Wang,Wen-Chih Peng,Tien-Fu Chen

http://arxiv.org/abs/2312.04142v1

Compressor summary: TimeDRL is a novel framework that learns disentangled embeddings from multivariate time-series data using timestamp-predictive and instance-contrastive tasks, without relying on augmentation methods or transformation-invariance.


Polarimetric Light Transport Analysis for Specular Inter-reflection

Ryota Maeda,Shinsaku Hiura

http://arxiv.org/abs/2312.04140v1

Compressor summary: The paper introduces a novel polarimetric method to decompose specular inter-reflections of metal objects by analyzing the rotation direction of linear polarization.


Using a Large Language Model to generate a Design Structure Matrix

Edwin C. Y. Koh

http://arxiv.org/abs/2312.04134v1

Compressor summary: The paper proposes a workflow using a large language model to help create Design Structure Matrices for complex engineering systems, which could save time and resources compared to traditional manual methods.


Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak

Yanrui Du,Sendong Zhao,Ming Ma,Yuhan Chen,Bing Qin

http://arxiv.org/abs/2312.04127v1

Compressor summary: The paper introduces a new jailbreak attack method called RADIAL, which exploits the inherent response tendencies of large language models to generate harmful responses when given specific real-world instructions with embedded malicious instructions.


Forensic Iris Image Synthesis

Rasel Ahmed Bhuiyan,Adam Czajka

http://arxiv.org/abs/2312.04125v1

Compressor summary: The paper presents a new iris synthesis model using StyleGAN to generate realistic post-mortem iris images for data collection and training purposes in forensic identification.


Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play

Timothy Schaumlöffel,Arthur Aubret,Gemma Roig,Jochen Triesch

http://arxiv.org/abs/2312.04118v1

Compressor summary: The study proposes a computational model to investigate how caregivers' utterances during play sessions can enhance infants' ability to recognize and categorize objects visually.


Instance Tracking in 3D Scenes from Egocentric Videos

Yunhan Zhao,Haoyu Ma,Shu Kong,Charless Fowlkes

http://arxiv.org/abs/2312.04117v1

Compressor summary: The authors introduce a new dataset and evaluation protocol for instance tracking in real-world 3D scenes from egocentric videos, and present a simple method that outperforms SOT-based approaches.


Multi-strategy Collaborative Optimized YOLOv5s and its Application in Distance Estimation

Zijian Shen,Zhenping Mu,Xiangxiang Li

http://arxiv.org/abs/2312.04113v1

Compressor summary: The text describes a new neural network model for vehicle target detection and distance estimation in automobiles, which improves safety warnings and provides suggestions based on nonparametric testing.


Breaking the Entanglement of Homophily and Heterophily in Semi-supervised Node Classification

Henan Sun,Xunkai Li,Zhengyu Wu,Daohan Su,Rong-Hua Li,Guoren Wang

http://arxiv.org/abs/2312.04111v1

Compressor summary: AMUD introduces a new GNN model that adapts to homophily and heterophily in directed graphs, improving node representations and graph learning efficiency.


Identity-Obscured Neural Radiance Fields: Privacy-Preserving 3D Facial Reconstruction

Jiayi Kong,Baixin Xu,Xurui Song,Chen Qian,Jun Luo,Ying He

http://arxiv.org/abs/2312.04106v1

Compressor summary: The proposed method reconstructs 3D head geometry with NeRF using identity-obscured inputs to preserve facial privacy.


Enhancing the Rationale-Input Alignment for Self-explaining Rationalization

Wei Liu,Haozhao Wang,Jun Wang,Zhiying Deng,YuanKai Zhang,Cheng Wang,Ruixuan Li

http://arxiv.org/abs/2312.04103v1

Compressor summary: The paper proposes DAR, a method that improves explanation quality in deep learning models by aligning the selected rationale with the original input to avoid the rationale shift problem.


Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection

Tuan Hoang,Santu Rana,Sunil Gupta,Svetha Venkatesh

http://arxiv.org/abs/2312.04095v1

Compressor summary: Projected-Gradient Unlearning (PGU) is a method for removing specific data samples from a machine learning model without affecting its performance on the remaining dataset, using an efficient algorithm that can handle any model and dataset size.


Open-Vocabulary Segmentation with Semantic-Assisted Calibration

Yong Liu,Sule Bai,Guanbin Li,Yitong Wang,Yansong Tang

http://arxiv.org/abs/2312.04089v1

Compressor summary: The paper proposes SCAN, a method for open-vocabulary segmentation that uses CLIP's generalized contextual prior to improve alignment of visual content with unbounded text and introduces SG-IoU, a new metric to address semantic duplication issues.


VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models

Zongjie Li,Chaozheng Wang,Chaowei Liu,Pingchuan Ma,Daoyuan Wu,Shuai Wang,Cuiyun Gao

http://arxiv.org/abs/2312.04087v1

Compressor summary: The study analyzes the performance of Large Multimodal Models (LMMs) using various visual referring prompting strategies, introducing a new benchmark dataset called VRPTEST and finding that the choice of prompt strategy significantly affects accuracy.


MTVG : Multi-text Video Generation with Text-to-Video Models

Gyeongrok Oh,Jaehwan Jeong,Sieun Kim,Wonmin Byeon,Jinkyu Kim,Sungwoong Kim,Hyeokmin Kwon,Sangpil Kim

http://arxiv.org/abs/2312.04086v1

Compressor summary: The authors propose a novel method for generating videos from multiple texts with diverse events, using a pre-trained diffusion-based model and several techniques to ensure visual consistency and coherence.


On the adaptation of in-context learners for system identification

Dario Piga,Filippo Pura,Marco Forgione

http://arxiv.org/abs/2312.04083v1

Compressor summary: The paper explores how adapting meta-models can improve predictive performance in different scenarios of system identification, enhancing robustness and versatility.


Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Zhaoheng Zheng,Jingmin Wei,Xuefeng Hu,Haidong Zhu,Ram Nevatia

http://arxiv.org/abs/2312.04076v1

Compressor summary: The paper proposes LLaMP, a method to integrate Large Language Models into pre-trained Vision-Language models for low-shot image classification by generating adaptive prompts for the CLIP text encoder.


A Transformer Model for Symbolic Regression towards Scientific Discovery

Florian Lalande,Yoshitomo Matsubara,Naoya Chiba,Tatsunori Taniai,Ryo Igarashi,Yoshitala Ushiku

http://arxiv.org/abs/2312.04070v1

Compressor summary: The paper introduces a new Transformer model for Symbolic Regression that can find mathematical expressions for datasets without interpretation issues, but with more computation and flexibility needed to avoid overfitting, achieving state-of-the-art results on SRSD datasets.


MeanCut: A Greedy-Optimized Graph Clustering via Path-based Similarity and Degree Descent Criterion

Dehua Peng,Zhipeng Gui,Huayi Wu

http://arxiv.org/abs/2312.04067v1

Compressor summary: The paper proposes a new graph clustering method, MeanCut, that uses path-based similarity to handle non-spherical data and improve cluster associations, while reducing computational complexity and enhancing robustness.


Combining inherent knowledge of vision-language models with unsupervised domain adaptation through self-knowledge distillation

Thomas Westfechtel,Dexuan Zhang,Tatsuya Harada

http://arxiv.org/abs/2312.04066v1

Compressor summary: The paper proposes a method that combines unsupervised domain adaptation with vision-language models to improve zero-shot prediction accuracy on image classification tasks, using data from source and target domains and adjusting class probabilities.


A Robust and Efficient Boundary Point Detection Method by Measuring Local Direction Dispersion

Dehua Peng,Zhipeng Gui,Huayi Wu

http://arxiv.org/abs/2312.04065v1

Compressor summary: The paper proposes LoDD, a method for detecting boundary points in machine learning tasks, which uses KNN and eigenvalues of the covariance matrix to measure centrality and performs well on synthetic and real datasets.


An unsupervised approach towards promptable defect segmentation in laser-based additive manufacturing by Segment Anything

Israt Zarin Era,Imtiaz Ahmed,Zhichao Liu,Srinjoy Das

http://arxiv.org/abs/2312.04063v1

Compressor summary: The paragraph describes a framework for real-time image segmentation in manufacturing using a Foundation model with unsupervised prompt generation, which could improve product quality and enable Industry 4.0.


Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching

Junsheng Zhou,Baorui Ma,Wenyuan Zhang,Yi Fang,Yu-Shen Liu,Zhizhong Han

http://arxiv.org/abs/2312.04060v1

Compressor summary: The authors propose a novel method to register 2D images and 3D point clouds using a structured cross-modality latent space learned by a triplet network and a differentiable probabilistic PnP solver, achieving state-of-the-art results on KITTI and nuScenes datasets.


Comparing Large Language Model AI and Human-Generated Coaching Messages for Behavioral Weight Loss

Zhuoran Huang,Michael P. Berry,Christina Chwyl,Gary Hsieh,Jing Wei,Evan M. Forman

http://arxiv.org/abs/2312.04059v1

Compressor summary: LLM AI chatbots like ChatGPT can generate personalized and novel weight-loss coaching messages that are as helpful as human-written ones, but need improvements in authenticity and data focus.


Jointly spatial-temporal representation learning for individual trajectories

Fei Huang,Jianrong Lv,Yang Yue

http://arxiv.org/abs/2312.04055v1

Compressor summary: The paper proposes a new method (ST-GraphRL) to represent human trajectories in a way that captures their spatial and temporal dependencies, which improves the performance of geospatial foundation models.


Multimodal Misinformation Detection in a South African Social Media Environment

Amica De Jager,Vukosi Marivate,Abioudun Modupe

http://arxiv.org/abs/2312.04052v1

Compressor summary: The paper presents a multimodal misinformation detection model for South African social media that uses textual and visual information, and shows its improved performance compared to unimodal models.


Residual Graph Convolutional Network for Bird's-Eye-View Semantic Segmentation

Qiuxiao Chen,Xiaojun Qi

http://arxiv.org/abs/2312.04044v1

Compressor summary: The paper proposes a Residual Graph Convolutional (RGC) module for Bird's-Eye-View semantic segmentation that improves global information and region-level semantic relationships using graph space projection and data augmentation.


Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes

Hmrishav Bandyopadhyay,Subhadeep Koley,Ayan Das,Aneeshan Sain,Pinaki Nath Chowdhury,Tao Xiang,Ayan Kumar Bhunia,Yi-Zhe Song

http://arxiv.org/abs/2312.04043v1

Compressor summary: The paper presents a new framework for generating 3D shapes from sketches that simplifies the process, allows editing, and works efficiently.


Reconstruction of dynamical systems from data without time labels

Zhijun Zeng,Pipi Hu,Chenglong Bao,Yi Zhu,Zuoqiang Shi

http://arxiv.org/abs/2312.04038v1

Compressor summary: The paper proposes a method to reconstruct dynamical systems from data without time labels using sliced Wasserstein distance to minimize distribution loss.


DiffusionPhase: Motion Diffusion in Frequency Domain

Weilin Wan,Yiming Huang,Shutong Wu,Taku Komura,Wenping Wang,Dinesh Jayaraman,Lingjie Liu

http://arxiv.org/abs/2312.04036v1

Compressor summary: The study presents a method for generating diverse and smooth human motion sequences from text descriptions using a network encoder and a conditional diffusion model in the frequency domain.


RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training

Jaehyung Kim,Yuning Mao,Rui Hou,Hanchao Yu,Davis Liang,Pascale Fung,Qifan Wang,Fuli Feng,Lifu Huang,Madian Khabsa

http://arxiv.org/abs/2312.04032v1

Compressor summary: The paper proposes RoAST, a technique that enhances the multi-perspective robustness of pre-trained language models by incorporating adversarial perturbation and selective training during fine-tuning.


Modeling Boundedly Rational Agents with Latent Inference Budgets

Athul Paul Jacob,Abhishek Gupta,Jacob Andreas

http://arxiv.org/abs/2312.04030v1

Compressor summary: The latent inference budget model (L-IBM) is a new approach that explicitly simulates agents' computational constraints in models of bounded rationality, and shows promising results in various tasks involving suboptimal decision-making.


Improved Face Representation via Joint Label Classification and Supervised Contrastive Clustering

Zhenduo Zhang

http://arxiv.org/abs/2312.04029v1

Compressor summary: The paper introduces a new method for improving face recognition by using cluster knowledge from face clustering tasks in two ways, extending ArcFace with a cluster-guided angular margin and aligning cluster centers with class centers in the classifier.


ImFace++: A Sophisticated Nonlinear 3D Morphable Face Model with Implicit Neural Representations

Mingwu Zheng,Haiyu Zhang,Hongyu Yang,Liming Chen,Di Huang

http://arxiv.org/abs/2312.04028v1

Compressor summary: The paper introduces ImFace++, a novel 3D morphable face model that learns continuous neural representations with disentangled deformation fields, refinement displacement field, and Neural Blend-Field to capture complex facial shapes and expressions for various computer vision and graphics applications.


The sample complexity of multi-distribution learning

Binghui Peng

http://arxiv.org/abs/2312.04027v1

Compressor summary: The paper provides an algorithm for multi-distribution learning with sample complexity $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$ and resolves a COLT 2023 open problem.


k* Distribution: Evaluating the Latent Space of Deep Neural Networks using Local Neighborhood Analysis

Shashank Kotyan,Ueda Tatsuya,Danilo Vasconcellos Vargas

http://arxiv.org/abs/2312.04024v1

Compressor summary: The k* Distribution method analyzes the structure of sample distributions within specific classes in the learned latent space of neural networks, revealing different distribution types and enabling a deeper understanding of how these networks process various classes.


A Study on the Calibration of In-context Learning

Hanlin Zhang,Yi-Fan Zhang,Yaodong Yu,Dhruv Madeka,Dean Foster,Eric Xing,Hima Lakkaraju,Sham Kakade

http://arxiv.org/abs/2312.04021v1

Compressor summary: The study examines the trade-offs between performance and calibration of large language models in in-context learning tasks and suggests that current recalibration techniques may not be sufficient for ensuring reliability.


PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

Ardian Umam,Cheng-Kun Yang,Min-Hung Chen,Jen-Hui Chuang,Yen-Yu Lin

http://arxiv.org/abs/2312.04016v1

Compressor summary: The paper introduces PartDistill, a framework that transfers 2D knowledge from vision-language models to improve 3D shape part segmentation using cross-modal distillation.


Natural-language-driven Simulation Benchmark and Copilot for Efficient Production of Object Interactions in Virtual Road Scenes

Kairui Yang,Zihao Guo,Gengjie Lin,Haotian Dong,Die Zuo,Jibin Peng,Zhao Huang,Zhecheng Xu,Fupeng Li,Ziyun Bai,Di Lin

http://arxiv.org/abs/2312.04008v1

Compressor summary: The authors propose a natural-language-driven simulation for creating realistic object interactions in virtual driving scenes and present a new method called SimCopilot to evaluate their approach using the Language-to-Interaction dataset.


KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis

Youngwan Lee,Kwanyong Park,Yoorhim Cho,Yong-Ju Lee,Sung Ju Hwang

http://arxiv.org/abs/2312.04005v1

Compressor summary: The paper proposes an efficient text-to-image model by distilling knowledge from the larger and faster Stable Diffusion XL model, addressing its high computation cost and size requirements.


LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

Vimal Thilak,Chen Huang,Omid Saremi,Laurent Dinh,Hanlin Goh,Preetum Nakkiran,Joshua M. Susskind,Etai Littwin

http://arxiv.org/abs/2312.04000v1

Compressor summary: LiDAR is a metric that measures the quality of representations in joint embedding architectures by quantifying the rank of the LDA matrix associated with a surrogate self-supervised learning task.


Series2Vec: Similarity-based Self-supervised Representation Learning for Time Series Classification

Navid Mohammadi Foumani,Chang Wei Tan,Geoffrey I. Webb,Mahsa Salehi

http://arxiv.org/abs/2312.03998v1

Compressor summary: The authors propose a new self-supervised method called Series2Vec, which predicts the similarity between two time series in both temporal and spectral domains, and shows its effectiveness on various real-world datasets.


Stable diffusion for Data Augmentation in COCO and Weed Datasets

Boyang Deng,Yuzhen Lu

http://arxiv.org/abs/2312.03996v1

Compressor summary: The paragraph discusses using stable diffusion generative models to improve object detection and classification tasks with synthetic images from small datasets, and evaluates their performance on various categories from the COCO dataset and weed species in Michigan.


Style Transfer to Calvin and Hobbes comics using Stable Diffusion

Sloke Shrestha,Sundar Sripada V. S.,Asvin Venkataramanan

http://arxiv.org/abs/2312.03993v1

Compressor summary: The report describes using stable-diffusion-v1.5 with LoRA to perform style transfer on Calvin and Hobbes comics, achieving good visual results.


MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

Xiao-Yin Liu,Xiao-Hu Zhou,Guo-Tao Li,Hao Li,Mei-Jiang Gui,Tian-Yu Xiang,De-Xing Huang,Zeng-Guang Hou

http://arxiv.org/abs/2312.03991v1

Compressor summary: The paper proposes a new model-based offline reinforcement learning algorithm (MICRO) that balances performance and robustness by using a conservative Bellman operator and reduces computation cost compared to previous methods.


Rapid detection of rare events from in situ X-ray diffraction data using machine learning

Weijian Zheng,Jun-Sang Park,Peter Kenesei,Ahsan Ali,Zhengchun Liu,Ian T. Foster,Nicholas Schwarz,Rajkumar Kettimuthu,Antonino Miceli,Hemant Sharma

http://arxiv.org/abs/2312.03989v1

Compressor summary: The paragraph describes a new automated technique for quickly detecting plasticity in metallic materials using high-energy X-ray microscopy, which is faster and works with sparser data than traditional methods.


Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration

Meihao Fan,Xiaoyue Han,Ju Fan,Chengliang Chai,Nan Tang,Guoliang Li,Xiaoyong Du

http://arxiv.org/abs/2312.03987v1

Compressor summary: The paper proposes BATCHER, a cost-effective batch prompting approach for entity resolution using large language models without fine-tuning or manual prompting.


Node-aware Bi-smoothing: Certified Robustness against Graph Injection Attacks

Yuni Lai,Yulin Zhu,Bailin Pan,Kai Zhou

http://arxiv.org/abs/2312.03979v1

Compressor summary: The text introduces a new framework for defending DGL models against graph injection attacks, which is model-agnostic and provides theoretical and empirical evidence of its effectiveness.


Improving Medical Report Generation with Adapter Tuning and Knowledge Enhancement in Vision-Language Foundation Models

Shibin Wu,Bang Yang,Zhiyu Ye,Haoqian Wang,Hairong Zheng,Tong Zhang

http://arxiv.org/abs/2312.03970v1

Compressor summary: The study improves medical report generation using a customized vision-language model that integrates adapter tuning and medical knowledge enhancement, achieving better accuracy and coherence than existing methods.