arxiv compressed, 2023-11-28

This page contains one-sentence summaries of cs.AI/ML/CV papers announced on 2023-11-28 generated by the compressor, my personal LLM-based project.


Material Palette: Extraction of Materials from a Single Image

Ivan Lopes,Fabio Pizzati,Raoul de Charette

http://arxiv.org/abs/2311.17060v1

Compressor summary: The paper presents a method to create physically-based rendering materials from real images using diffusion model, texture generation, SVBRDF decomposition, and unsupervised domain adaptation.


HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

Xian Liu,Xiaohang Zhan,Jiaxiang Tang,Ying Shan,Gang Zeng,Dahua Lin,Xihui Liu,Ziwei Liu

http://arxiv.org/abs/2311.17061v1

Compressor summary: HumanGaussian is a new method to generate realistic 3D human models from text prompts using adaptive Gaussian Splatting and Structure-Aware Score Distillation Sampling.


Panoptic Video Scene Graph Generation

Jingkang Yang,Wenxuan Peng,Xiangtai Li,Zujin Guo,Liangyu Chen,Bo Li,Zheng Ma,Kaiyang Zhou,Wayne Zhang,Chen Change Loy,Ziwei Liu

http://arxiv.org/abs/2311.17058v1

Compressor summary: PVSG is a new problem that aims to generate pixel-level segmented scene graphs from videos, overcoming the limitations of VidSGG that uses bounding boxes.


ReMoS: Reactive 3D Motion Synthesis for Two-Person Interactions

Anindita Ghosh,Rishabh Dabral,Vladislav Golyanik,Christian Theobalt,Philipp Slusallek

http://arxiv.org/abs/2311.17057v1

Compressor summary: ReMoS is a model that generates realistic reactive motions for two people interacting, such as dancing or fighting, given the motion of one person, and can handle complex scenarios with full-body and hand interactions.


Self-Supervised Motion Magnification by Backpropagating Through Optical Flow

Zhaoying Pan,Daniel Geng,Andrew Owens

http://arxiv.org/abs/2311.17056v1

Compressor summary: The paper presents a self-supervised method to magnify subtle motions in video using a loss function that estimates the optical flow and penalizes deviations from the desired magnification factor, which can be adapted at test time and applied to selected objects without needing synthetic datasets.


Rethinking Directional Integration in Neural Radiance Fields

Congyue Deng,Jiawei Yang,Leonidas Guibas,Yue Wang

http://arxiv.org/abs/2311.16504v1

Compressor summary: The paper proposes a simple modification to the NeRF rendering equation to improve view-dependent effects and achieve better rendering quality.


No Representation Rules Them All in Category Discovery

Sagar Vaze,Andrea Vedaldi,Andrew Zisserman

http://arxiv.org/abs/2311.17055v1

Compressor summary: The paper introduces Clevr-4, a synthetic dataset for Generalized Category Discovery (GCD) that requires models to extrapolate the taxonomy from labels and outperforms existing methods on it.


Surf-D: High-Quality Surface Generation for Arbitrary Topologies using Diffusion Models

Zhengming Yu,Zhiyang Dou,Xiaoxiao Long,Cheng Lin,Zekun Li,Yuan Liu,Norman Müller,Taku Komura,Marc Habermann,Christian Theobalt,Xin Li,Wenping Wang

http://arxiv.org/abs/2311.17050v1

Compressor summary: The paper introduces Surf-D, a method for generating high-quality 3D shapes using Diffusion models and Unsigned Distance Field representation, which handles arbitrary topologies and performs well in various shape generation tasks.


MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Pavan Kumar Anasosalu Vasu,Hadi Pouransari,Fartash Faghri,Raviteja Vemulapalli,Oncel Tuzel

http://arxiv.org/abs/2311.17049v1

Compressor summary: The paper introduces MobileCLIP, a family of efficient image-text models optimized for runtime performance and a novel multi-modal reinforced training approach that improves accuracy and learning efficiency.


Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zeyu Han,Fangrui Zhu,Qianru Lao,Huaizu Jiang

http://arxiv.org/abs/2311.17048v1

Compressor summary: Zero-shot referring expression comprehension involves locating objects in an image based on textual prompts, and the authors propose a method using foundation models, triplets, and fine-tuning VLA models for improved performance.


LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Yanwei Li,Chengyao Wang,Jiaya Jia

http://arxiv.org/abs/2311.17043v1

Compressor summary: The LLaMA-VID method generates tokens for VLMs to process long videos by using context and content tokens, reducing computational burdens and improving performance on video and image benchmarks.


Adversarial Diffusion Distillation

Axel Sauer,Dominik Lorenz,Andreas Blattmann,Robin Rombach

http://arxiv.org/abs/2311.17042v1

Compressor summary: ADD trains large-scale image diffusion models to generate high-quality images in just 1-4 steps using score distillation and adversarial losses.


Efficient In-Context Learning in Vision-Language Models for Egocentric Videos

Keunwoo Peter Yu,Zheyuan Zhang,Fengyuan Hu,Joyce Chai

http://arxiv.org/abs/2311.17041v1

Compressor summary: The paper proposes a new method to train vision-language models for egocentric videos using in-context learning with minimal data, achieving better performance and adaptability than existing methods.


Scalable Extraction of Training Data from (Production) Language Models

Milad Nasr,Nicholas Carlini,Jonathan Hayase,Matthew Jagielski,A. Feder Cooper,Daphne Ippolito,Christopher A. Choquette-Choo,Eric Wallace,Florian Tramèr,Katherine Lee

http://arxiv.org/abs/2311.17035v1

Compressor summary: The paper demonstrates how adversaries can efficiently extract training data from various machine learning models, including closed models like ChatGPT, by developing a new attack technique that exploits model misbehavior.


Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Junyi Zhang,Charles Herrmann,Junhwa Hur,Eric Chen,Varun Jampani,Deqing Sun,Ming-Hsuan Yang

http://arxiv.org/abs/2311.17034v1

Compressor summary: The paper proposes geometry-aware solutions to improve semantic correspondence in vision models, creating a new benchmark dataset and achieving state-of-the-art results.


Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Aleksandar Makelov,Georg Lange,Neel Nanda

http://arxiv.org/abs/2311.17030v1

Compressor summary: The paper explores how subspace interventions in models can lead to misleading interpretability, but also provides examples of successful cases and suggests more evidence is needed for faithfulness.


When the Few Outweigh the Many: Illicit Content Recognition with Few-Shot Learning

G. Cascavilla,G. Catolino,M. Conti,D. Mellios,D. A. Tamburri

http://arxiv.org/abs/2311.17026v1

Compressor summary: The paper explores using Siamese neural networks for recognizing illegal activities on the Dark Web from images, showing promising results with 90.9% accuracy on a small dataset.


Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

Niladri Shekhar Dutt,Sanjeev Muralikrishnan,Niloy J. Mitra

http://arxiv.org/abs/2311.17024v1

Compressor summary: Diff3F is a feature descriptor for untextured shapes that uses conditional image synthesis from depth and normal maps to create robust, semantic features for correspondence across shape families.


Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Danah Yatim,Rafail Fridman,Omer Bar Tal,Yoni Kasten,Tali Dekel

http://arxiv.org/abs/2311.17009v1

Compressor summary: The paper proposes a text-driven motion transfer method that can synthesize videos with different objects and motions, using a pre-trained model and a new space-time feature loss.


Computational Hypergraph Discovery, a Gaussian Process framework for connecting the dots

Théo Bourdais,Pau Batlle,Xianjin Yang,Ricardo Baptista,Nicolas Rouquette,Houman Owhadi

http://arxiv.org/abs/2311.17007v1

Compressor summary: The paper introduces a GP framework for discovering and completing computational hypergraphs from partial observations, using a kernel generalization of Row Echelon Form reduction.


An Investigation of Time Reversal Symmetry in Reinforcement Learning

Brett Barkley,Amy Zhang,David Fridovich-Keil

http://arxiv.org/abs/2311.17008v1

Compressor summary: The paper explores using time reversal symmetry in Markov decision processes to increase sample efficiency in reinforcement learning, but notes that it may not work well in all environments.


On the Impact of Sampling on Deep Sequential State Estimation

Helena Calatrava,Ricardo Augusto Borsoi,Tales Imbiriba,Pau Closas

http://arxiv.org/abs/2311.17006v1

Compressor summary: The paper proposes IW-DKF, which uses importance sampling to improve state inference and parameter learning in deep Kalman filter for sequential models, showing better generative modeling performance and state estimation accuracy.


MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Kunchang Li,Yali Wang,Yinan He,Yizhuo Li,Yi Wang,Yi Liu,Zun Wang,Jilan Xu,Guo Chen,Ping Luo,Limin Wang,Yu Qiao

http://arxiv.org/abs/2311.17005v1

Compressor summary: MVBench is a benchmark for evaluating multi-modal language models' comprehension of dynamic videos, covering 20 challenging tasks that require temporal understanding.


Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Yutong Feng,Biao Gong,Di Chen,Yujun Shen,Yu Liu,Jingren Zhou

http://arxiv.org/abs/2311.17002v1

Compressor summary: The authors propose Ranni, a system that improves text-to-image diffusion models by using a semantic panel as a middleware to better follow complex instructions and enable more convenient interaction for users.


Goal-conditioned Offline Planning from Curious Exploration

Marco Bagatella,Georg Martius

http://arxiv.org/abs/2311.16996v1

Compressor summary: The text discusses using exploration techniques in deep reinforcement learning to extract goal-conditioned behavior without additional environment interaction, and proposes a method to combine model-based planning with graph-based value aggregation to improve zero-shot goal-reaching performance.


ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?

Hailin Chen,Fangkai Jiao,Xingxuan Li,Chengwei Qin,Mathieu Ravaut,Ruochen Zhao,Caiming Xiong,Shafiq Joty

http://arxiv.org/abs/2311.16989v1

Compressor summary: ChatGPT's release in 2022 sparked a surge in interest and development of large language models (LLMs), leading to rapid progress in both open-source and closed-source LLMs, with significant implications for research and business.


Assessing the influence of attractor-verb distance on grammatical agreement in humans and language models

Christos-Nikolaos Zacharopoulos,Théo Desbordes,Mathias Sablé-Meyer

http://arxiv.org/abs/2311.16978v1

Compressor summary: The paragraph discusses how the distance between an attractor noun and a verb affects grammatical judgments and reaction times, with neural networks performing differently from humans.


COLE: A Hierarchical Generation Framework for Graphic Design

Peidong Jia,Chenxuan Li,Zeyu Liu,Yichao Shen,Xingru Chen,Yuhui Yuan,Yinglin Zheng,Dong Chen,Ji Li,Xiaodong Xie,Shanghang Zhang,Baining Guo

http://arxiv.org/abs/2311.16974v1

Compressor summary: COLE is a hierarchical framework that uses specialized models to generate and edit high-quality graphic designs based on user input, enhancing reliability and streamlining the process.


Natural Language Processing Through Transfer Learning: A Case Study on Sentiment Analysis

Aman Yadav,Abhishek Vichare

http://arxiv.org/abs/2311.16965v1

Compressor summary: The paper shows that using pre-trained BERT models for sentiment analysis on IMDb movie reviews improves accuracy, but cautions against overfitting or lack of generalization without further analysis.


HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

Jingbo Zhang,Xiaoyu Li,Qi Zhang,Yanpei Cao,Ying Shan,Jing Liao

http://arxiv.org/abs/2311.16961v1

Compressor summary: The paper proposes a method called HumanRef for creating realistic 3D human models from one image that preserves texture details and maintains consistency in different views using reference-guided score distillation sampling and region-aware attention.


UC-NeRF: Neural Radiance Field for Under-Calibrated multi-view cameras in autonomous driving

Kai Cheng,Xiaoxiao Long,Wei Yin,Jin Wang,Zhiqiang Wu,Yuexin Ma,Kaixuan Wang,Xiaozhi Chen,Xuejin Chen

http://arxiv.org/abs/2311.16945v1

Compressor summary: The paper introduces UC-NeRF, a method to improve NeRF's performance in under-calibrated multi-camera systems by addressing color inconsistency and pose calibration issues through layer-based correction, virtual warping, and spatiotemporal constraint refinement.


Image segmentation with traveling waves in an exactly solvable recurrent neural network

Luisa H. B. Liboni,Roberto C. Budzinski,Alexandra N. Busch,Sindy Löwe,Thomas A. Keller,Max Welling,Lyle E. Muller

http://arxiv.org/abs/2311.16943v1

Compressor summary: The text describes a recurrent neural network that uses complex numbers to perform image segmentation by dividing an image into groups based on structural characteristics, with a simple algorithm that generalizes across different input types.


Debiasing Multimodal Models via Causal Information Minimization

Vaidehi Patil,Adyasha Maharana,Mohit Bansal

http://arxiv.org/abs/2311.16941v1

Compressor summary: The paper proposes a novel debiasing method for multimodal models using causally-motivated information minimization to learn confounder representations and improve OOD performance without sacrificing in-distribution performance.


The Sky's the Limit: Re-lightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility

James A. D. Gardner,Evgenii Kashin,Bernhard Egger,William A. P. Smith

http://arxiv.org/abs/2311.16937v1

Compressor summary: The authors propose a method to infer albedo, geometry, illumination, and sky visibility from unconstrained images using neural networks, achieving state-of-the-art results on a benchmark dataset.


SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

Yuwei Guo,Ceyuan Yang,Anyi Rao,Maneesh Agrawala,Dahua Lin,Bo Dai

http://arxiv.org/abs/2311.16933v1

Compressor summary: SparseCtrl is a method for controlling video generation with minimal input signals, such as sketches or depth maps, improving flexibility and practicality for various applications.


LLaFS: When Large-Language Models Meet Few-Shot Segmentation

Lanyun Zhu,Tianrun Chen,Deyi Ji,Jieping Ye,Jun Liu

http://arxiv.org/abs/2311.16926v1

Compressor summary: The paper introduces LLaFS, a method that uses large language models to improve few-shot segmentation by incorporating prior knowledge and providing multi-modal guidance.


Super-Resolution through StyleGAN Regularized Latent Search: A Realism-Fidelity Trade-off

Marzieh Gheisari,Auguste Genovesio

http://arxiv.org/abs/2311.16923v1

Compressor summary: The paper proposes a new method to improve super-resolution by constraining the search in the latent space of StyleGAN and expanding the image prior around the optimal code, achieving realistic and high-quality results.


Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Sicong Leng,Hang Zhang,Guanzheng Chen,Xin Li,Shijian Lu,Chunyan Miao,Lidong Bing

http://arxiv.org/abs/2311.16922v1

Compressor summary: The Visual Contrastive Decoding method reduces object hallucinations in large vision-language models by contrasting output distributions from original and distorted visual inputs without additional training or external tools.


RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D

Lingteng Qiu,Guanying Chen,Xiaodong Gu,Qi Zuo,Mutian Xu,Yushuang Wu,Weihao Yuan,Zilong Dong,Liefeng Bo,Xiaoguang Han

http://arxiv.org/abs/2311.16918v1

Compressor summary: The paper proposes a Normal-Depth diffusion model for 3D generation using LAION dataset and introduces an albedo diffusion model to handle mixed illumination effects, achieving state-of-the-art results.


UGG: Unified Generative Grasping

Jiaxin Lu,Hao Kang,Haoxiang Li,Bo Liu,Yiding Yang,Qixing Huang,Gang Hua

http://arxiv.org/abs/2311.16917v1

Compressor summary: The UGG model generates diverse and successful grasping postures for objects by using a diffusion-based approach that incorporates object, hand, and contact information.


Brain-ID: Learning Robust Feature Representations for Brain Imaging

Peirong Liu,Oula Puonti,Xiaoling Hu,Daniel C. Alexander,Juan Eugenio Iglesias

http://arxiv.org/abs/2311.16914v1

Compressor summary: Brain-ID is a robust feature representation learning strategy for brain imaging that works well on various tasks and datasets, even with limited training data.


Lane-Keeping Control of Autonomous Vehicles Through a Soft-Constrained Iterative LQR

Der-Hau Lee

http://arxiv.org/abs/2311.16900v1

Compressor summary: The soft-CILQR algorithm improves the stability and smoothness of autonomous vehicle steering by using slack variables to soften constraints in the optimization process.


Dendrogram distance: an evaluation metric for generative networks using hierarchical clustering

Gustavo Sutter Carvalho,Moacir Antonelli Ponti

http://arxiv.org/abs/2311.16894v1

Compressor summary: The paper proposes a new way to measure how well generative models capture all aspects of the data using dendrograms and shows it performs as well as existing methods.


Compressing the Backward Pass of Large-Scale Neural Architectures by Structured Activation Pruning

Daniel Barley,Holger Fröning

http://arxiv.org/abs/2311.16883v1

Compressor summary: This paper proposes a method to reduce memory usage in DNNs during training by pruning activations using structured pruning and block sparsity, achieving up to 32% memory reduction without sacrificing accuracy on image classification tasks.


Optimisation-Based Multi-Modal Semantic Image Editing

Bowen Li,Yongxin Yang,Steven McDonagh,Shifeng Zhang,Petru-Daniel Tudosiu,Sarah Parisot

http://arxiv.org/abs/2311.16882v1

Compressor summary: The paper introduces an image editing method that allows various types of instructions and balances local modifications with global consistency using two loss functions.


Imputation using training labels and classification via label imputation

Thu Nguyen,Pål Halvorsen,Michael A. Riegler

http://arxiv.org/abs/2311.16877v1

Compressor summary: The authors propose a method that stacks the label with the input for imputation, improving the performance of classification models with missing data.


A unified weighting framework for evaluating nearest neighbour classification

Oliver Urs Lenz,Henri Bollaert,Chris Cornelis

http://arxiv.org/abs/2311.16872v1

Compressor summary: The paper evaluates different methods for nearest neighbor classification using fuzzy logic and kernel functions, finding that some perform best with Boscovich distance and others with Yager negation.


The Falcon Series of Open Language Models

Ebtesam Almazrouei,Hamza Alobeidli,Abdulaziz Alshamsi,Alessandro Cappelli,Ruxandra Cojocaru,Daniel Hesslow,Julien Launay,Quentin Malartic,Daniele Mazzotta,Badreddine Noune,Baptiste Pannier,Guilherme Penedo

http://arxiv.org/abs/2311.16867v1

Compressor summary: The Falcon series are causal decoder-only models with different sizes, pretrained on a large web dataset, and achieve high performance while being cost-effective.


A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography

Noëmi Aepli,Chantal Amrhein,Florian Schottmann,Rico Sennrich

http://arxiv.org/abs/2311.16865v1

Compressor summary: The paper evaluates how well evaluation metrics work for Swiss German dialects and proposes improvements to make them more robust.


Power Hungry Processing: Watts Driving the Cost of AI Deployment?

Alexandra Sasha Luccioni,Yacine Jernite,Emma Strubell

http://arxiv.org/abs/2311.16863v1

Compressor summary: This paper compares the environmental cost of different types of machine learning systems, finding that multi-purpose generative models are much more energy-intensive than task-specific ones.


Data-efficient operator learning for solving high Mach number fluid flow problems

Noah Ford,Victor J. Leon,Honest Merman,Jeffrey Gilbert,Alexander New

http://arxiv.org/abs/2311.16860v1

Compressor summary: The paragraph discusses using SciML with Neural Basis Functions to improve predictions for high Mach fluid flows over irregular geometries when data is limited.


Attentional Graph Neural Networks for Robust Massive Network Localization

Wenzhong Yan,Juntao Wang,Feng Yin,Abdelhak M. Zoubir

http://arxiv.org/abs/2311.16856v1

Compressor summary: The paper proposes a novel GNN-based method for network localization that is stable, accurate, and robust to NLOS propagations, and introduces two attentional graph neural networks (AGNNs) that improve accuracy and flexibility by learning optimal hyperparameters.


A Unified Approach for Text- and Image-guided 4D Scene Generation

Yufeng Zheng,Xueting Li,Koki Nagano,Sifei Liu,Otmar Hilliges,Shalini De Mello

http://arxiv.org/abs/2311.16854v1

Compressor summary: The authors propose Dream-in-4D, a novel two-stage method that leverages diffusion guidance to generate high-quality static and dynamic 3D scenes from text prompts, achieving significant improvements in quality and controllability compared to existing approaches.


Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration

Chen Zhao,Weiling Cai,Chenyu Dong,Chengwei Hu

http://arxiv.org/abs/2311.16845v1

Compressor summary: The paper introduces WF-Diff, a novel framework for enhancing underwater images using frequency domain information and diffusion models, which improves the visual quality of underwater images.


Self-training solutions for the ICCV 2023 GeoNet Challenge

Lijun Sheng,Zhengbo Wang,Jian Liang

http://arxiv.org/abs/2311.16843v1

Compressor summary: The paper introduces a new domain adaptation benchmark called GeoNet with three challenges and presents a two-stage source-free method using Swin Transformer to achieve high performance in all challenges.


The Claire French Dialogue Dataset

Julie Hunter,Jérôme Louradour,Virgile Rennard,Ismaïl Harrando,Guokan Shang,Jean-Pierre Lorré

http://arxiv.org/abs/2311.16840v1

Compressor summary: The Claire French Dialogue Dataset (CFDD) is a large corpus of diverse French texts released for developing multilingual language models, and this paper describes its composition, categories, and format.


Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Zhiyuan Zhao,Bin Wang,Linke Ouyang,Xiaoyi Dong,Jiaqi Wang,Conghui He

http://arxiv.org/abs/2311.16839v1

Compressor summary: The paper proposes HA-DPO, a novel approach to address the hallucination problem in multimodal language models by training them to prefer accurate responses over hallucinating ones, leading to improved accuracy and generalization.


Unified-modal Salient Object Detection via Adaptive Prompt Learning

Kunpeng Wang,Chenglong Li,Zhengzheng Tu,Bin Luo

http://arxiv.org/abs/2311.16835v1

Compressor summary: UniSOD is a framework that combines single-modal and multi-modal salient object detection using modality-aware prompts with task-specific hints, achieving consistent performance improvement on various datasets.


Modular Neural Networks for Time Series Forecasting: Interpretability and Feature Selection using Attention

Qiqi Su,Christos Kloukinas,Artur d'Garcez

http://arxiv.org/abs/2311.16834v1

Compressor summary: The paper presents an interpretable modular neural network model for multivariate time series prediction that combines a recurrent neural network with an attention-based feature selection component and achieves high performance comparable to non-interpretable methods.


1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness

Bernd Prach,Fabio Brau,Giorgio Buttazzo,Christoph H. Lampert

http://arxiv.org/abs/2311.16833v1

Compressor summary: This paper compares different methods for creating 1-Lipschitz neural networks, which are more robust against input perturbations, and provides guidelines to choose the best method depending on available resources.


CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models

Jinfeng Zhou,Zhuang Chen,Dazhen Wan,Bosi Wen,Yi Song,Jifan Yu,Yongkang Huang,Libiao Peng,Jiaming Yang,Xiyao Xiao,Sahand Sabour,Xiaohan Zhang,Wenjing Hou,Yijia Zhang,Yuxiao Dong,Jie Tang,Minlie Huang

http://arxiv.org/abs/2311.16832v1

Compressor summary: CharacterGLM is a series of large language models that can generate character-based dialogues for conversational AI systems, with better performance than most existing models in terms of consistency, human-likeness, and engagement.


Decomposer: Semi-supervised Learning of Image Restoration and Image Decomposition

Boris Meinardus,Mariusz Trzeciakiewicz,Tim Herzig,Monika Kwiatkowski,Simon Matern,Olaf Hellwich

http://arxiv.org/abs/2311.16829v1

Compressor summary: Decomposer is a semi-supervised model that uses 3D Swin-Transformers and 3D U-Nets to separate distorted images into their original components and the applied augmentations like shadows, lighting, and occlusions.


SARA: Controllable Makeup Transfer with Spatial Alignment and Region-Adaptive Normalization

Xiaojing Zhong,Xinyi Huang,Zhonghua Wu,Guosheng Lin,Qingyao Wu

http://arxiv.org/abs/2311.16828v1

Compressor summary: SARA is a novel method for makeup transfer that can handle large spatial misalignments, preserve the source images' identities, and achieve part-specific and shade-controllable results using three modules: spatial alignment, region-adaptive normalization, and makeup fusion.


Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

Martin Briesch,Dominik Sobania,Franz Rothlauf

http://arxiv.org/abs/2311.16822v1

Compressor summary: The study examines how large language models generate and consume content in a self-consuming loop, finding that this process improves quality and diversity at first but then decreases diversity over time.


DI-Net : Decomposed Implicit Garment Transfer Network for Digital Clothed 3D Human

Xiaojing Zhong,Yukun Su,Zhonghua Wu,Guosheng Lin,Qingyao Wu

http://arxiv.org/abs/2311.16818v1

Compressor summary: DI-Net is a new method for 3D virtual try-on that uses two modules to reconstruct a human mesh with accurate pose and texture preservation.


Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

Yuqing Wen,Yucheng Zhao,Yingfei Liu,Fan Jia,Yanhui Wang,Chong Luo,Chi Zhang,Tiancai Wang,Xiaoyan Sun,Xiangyu Zhang

http://arxiv.org/abs/2311.16813v1

Compressor summary: Panacea is a method to create diverse, annotated videos for autonomous driving research that ensures consistency and controllability in driving scenarios.


Agent-Aware Training for Agent-Agnostic Action Advising in Deep Reinforcement Learning

Yaoquan Wei,Shunyu Liu,Jie Song,Tongya Zheng,Kaixuan Chen,Yong Wang,Mingli Song

http://arxiv.org/abs/2311.16807v1

Compressor summary: The proposed A7 framework uses state feature similarity, proxy models, and behavior cloning to efficiently advise agents in DRL without relying on specific agents or expert teachers.


A Survey of the Evolution of Language Model-Based Dialogue Systems

Hongru Wang,Lingzhi Wang,Yiming Du,Liang Chen,Jingyan Zhou,Yufei Wang,Kam-Fai Wong

http://arxiv.org/abs/2311.16789v1

Compressor summary: The survey describes the four stages of dialogue system evolution, highlighting their dependence on language model advancements and discussing current challenges and future directions for LLM-based systems.


Evaluating Optimal Reference Translations

Vilém Zouhar,Věra Kloudová,Martin Popel,Ondřej Bojar

http://arxiv.org/abs/2311.16787v1

Compressor summary: The article introduces a method to create more reliable human reference translations for machine translation evaluation by raising the bar of human translation quality.


The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation

Christel Chappuis,Eliot Walt,Vincent Mendez,Sylvain Lobry,Bertrand Le Saux,Devis Tuia

http://arxiv.org/abs/2311.16782v1

Compressor summary: The text discusses language biases in remote sensing visual question answering (RSVQA), their impact on model performance and robustness, and the need for less-biased datasets and more informative evaluation metrics.


Multi-Channel Cross Modal Detection of Synthetic Face Images

M. Ibsen,C. Rathgeb,S. Marcel,C. Busch

http://arxiv.org/abs/2311.16773v1

Compressor summary: The authors propose a new method for detecting synthetic face images using Cross Modal Focal Loss, which performs better than existing methods in cross-model experiments.


Rescuing referral failures during automated diagnosis of domain-shifted medical images

Anuj Srivastava,Karm Patel,Pradeep Shenoy,Devarajan Sridharan

http://arxiv.org/abs/2311.16766v1

Compressor summary: The paper discusses the problem of selective classification during automated diagnosis with domain-shifted medical images and how current approaches fail to handle uncertainty in such cases, leading to poor performance and a need for human intervention.


Radiology-Aware Model-Based Evaluation Metric for Report Generation

Amos Calamida,Farhad Nooralahzadeh,Morteza Rohanian,Koji Fujimoto,Mizuho Nishio,Michael Krauthammer

http://arxiv.org/abs/2311.16764v1

Compressor summary: The authors propose a new evaluation metric for machine-generated radiology reports using COMET architecture, train models, and show that their metric correlates well with existing metrics and human judgment.


Towards Full-scene Domain Generalization in Multi-agent Collaborative Bird's Eye View Segmentation for Connected and Autonomous Driving

Senkang Hu,Zhengru Fang,Xianhao Chen,Yuguang Fang,Sam Kwong

http://arxiv.org/abs/2311.16754v1

Compressor summary: The paragraph discusses a framework for improving collaborative perception in autonomous vehicles by addressing domain shifts and data heterogeneity using Amplitude Augmentation and meta-consistency training, as well as an intra-system domain alignment mechanism during inference.


As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors

Seungwoo Yoo,Kunho Kim,Vladimir G. Kim,Minhyuk Sung

http://arxiv.org/abs/2311.16739v1

Compressor summary: APAP is a mesh deformation technique that uses 2D diffusion models and user input to create realistic and plausible edits of 2D and 3D meshes.


Riemannian Self-Attention Mechanism for SPD Networks

Rui Wang,Xiao-Jun Wu,Hui Li,Josef Kittler

http://arxiv.org/abs/2311.16738v1

Compressor summary: The paper proposes an SPD manifold self-attention mechanism (SMSA) for learning features in scientific areas, which uses geometric operations like Riemannian metric and optimization to capture dependencies of data on a curved Riemannian manifold.


Point'n Move: Interactive Scene Object Manipulation on Gaussian Splatting Radiance Fields

Jiajun Huang,Hongchuan Yu

http://arxiv.org/abs/2311.16737v1

Compressor summary: Point'n Move is an interactive scene manipulation method that uses Gaussian Splatting Radiance Field for real-time editing and inpainting of selected objects in scenes.


Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras

Huajian Huang,Longwei Li,Hui Cheng,Sai-Kit Yeung

http://arxiv.org/abs/2311.16728v1

Compressor summary: Photo-SLAM is a novel SLAM framework that uses explicit geometric features and implicit photometric features for efficient online photorealistic mapping on portable devices.


Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

Yijun Yang,Tianyi Zhou,Kanxue Li,Dapeng Tao,Lusong Li,Li Shen,Xiaodong He,Jing Jiang,Yuhui Shi

http://arxiv.org/abs/2311.16714v1

Compressor summary: The paper presents EMMA, an embodied multi-modal agent that adapts quickly to a visual world by distilling knowledge from a large language model in a parallel text world.


LEDITS++: Limitless Image Editing using Text-to-Image Models

Manuel Brack,Felix Friedrich,Katharina Kornmeier,Linoy Tsaban,Patrick Schramowski,Kristian Kersting,Apolinário Passos

http://arxiv.org/abs/2311.16711v1

Compressor summary: LEDITS++ is a text-based image editing technique that is efficient, versatile, and precise, supporting multiple edits without fine-tuning or optimization.


Sinkhorn Flow: A Continuous-Time Framework for Understanding and Generalizing the Sinkhorn Algorithm

Mohammad Reza Karimi,Ya-Ping Hsieh,Andreas Krause

http://arxiv.org/abs/2311.16706v1

Compressor summary: The text discusses entropy-regularized optimal transport problems in machine learning, introducing a continuous-time version of the Sinkhorn algorithm that can handle noise and bias and connects to other related dynamics.


CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs

Haocheng Yuan,Jing Xu,Hao Pan,Adrien Bousseau,Niloy Mitra,Changjian Li

http://arxiv.org/abs/2311.16703v1

Compressor summary: The authors propose a method to semantically comment on CAD programs by parsing them, generating shapes and images, and using visual-semantic analysis to assign labels to code blocks.


Rethinking Intermediate Layers design in Knowledge Distillation for Kidney and Liver Tumor Segmentation

Vandan Gorade,Sparsh Mittal,Debesh Jha,Ulas Bagci

http://arxiv.org/abs/2311.16700v1

Compressor summary: HLFD is a novel method that improves knowledge distillation for medical imaging tasks by strategically transferring knowledge from middle to earlier layers and vice versa, leading to better focus on tumor-specific features and improved performance.


Hyper-Relational Knowledge Graph Neural Network for Next POI

Jixiao Zhang,Yongkang Li,Ruotong Zou,Jingyuan Zhang,Zipei Fan,Xuan Song

http://arxiv.org/abs/2311.16683v1

Compressor summary: The paper proposes a new model, HKGNN, for POI recommendation in LBSN that considers hyper-relations, structural information, and side information to address data sparsity and improve recommendations.


ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention

Jiawei Wang,Changjian Li

http://arxiv.org/abs/2311.16682v1

Compressor summary: ContextSeg is a novel method that uses an autoencoder with dense distance fields and a Transformer with group-based labeling to achieve state-of-the-art semantic segmentation of strokes in computer vision.


Understanding the (Extra-)Ordinary: Validating Deep Model Decisions with Prototypical Concept-based Explanations

Maximilian Dreyer,Reduan Achtibat,Wojciech Samek,Sebastian Lapuschkin

http://arxiv.org/abs/2311.16681v1

Compressor summary: The authors propose a novel XAI framework that uses prototypes to convey both local and global decision-making strategies of DNNs, enabling better understanding, model validation, and detection of outlier behavior.


Entity-Aspect-Opinion-Sentiment Quadruple Extraction for Fine-grained Sentiment Analysis

Dan Ma,Jun Xu,Zongyu Wang,Xuezhi Cao,Yunsen Xian

http://arxiv.org/abs/2311.16678v1

Compressor summary: The paper introduces a new task (EASQE) for aspect-based sentiment analysis that decomposes aspect terms into entities and aspects, and proposes a baseline method (Trigger-Opinion) that outperforms existing approaches.


A Distribution-Based Threshold for Determining Sentence Similarity

Gioele Cadamuro,Marco Gruppo

http://arxiv.org/abs/2311.16675v1

Compressor summary: The authors propose a neural network based on siamese architecture to solve a semantic textual similarity problem with highly specific information, using a threshold to distinguish similar from dissimilar pairs and combining features from both distributions and distance functions to score predictions.


Large Language Models Meet Computer Vision: A Brief Survey

Raby Hamadi

http://arxiv.org/abs/2311.16673v1

Compressor summary: The paper surveys the latest advancements in transformers and their impact on computer vision and large language models, comparing different models and datasets, and suggesting future directions for research.


SplitNeRF: Split Sum Approximation Neural Field for Joint Geometry, Illumination, and Material Estimation

Jesus Zarzar,Bernard Ghanem

http://arxiv.org/abs/2311.16671v1

Compressor summary: The novel approach digitizes real-world objects by estimating their geometry, material properties, and lighting from posed images with fixed lighting using Neural Radiance Fields and image-based lighting.


PyTorch Geometric High Order: A Unified Library for High Order Graph Neural Network

Xiyuan Wang,Muhan Zhang

http://arxiv.org/abs/2311.16670v1

Compressor summary: PyTorch Geometric High Order (PyGHO) is a library that simplifies the implementation of high-order graph neural networks and improves their performance on real-world tasks.


LiveNVS: Neural View Synthesis on Live RGB-D Streams

Laura Fink,Darius Rückert,Linus Franke,Joachim Keinert,Marc Stamminger

http://arxiv.org/abs/2311.16668v1

Compressor summary: LiveNVS is a system that enables real-time, high-quality neural novel view synthesis for live RGB-D input using projected neural features and a generalizable neural network.


MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures

Zhuoyuan Wang,Jiacong Mi,Shan Lu,Jieyue He

http://arxiv.org/abs/2311.16666v1

Compressor summary: MolIG is a novel framework that uses both image and graph structures to predict drug molecule properties, outperforming existing models.


DGNR: Density-Guided Neural Point Rendering of Large Driving Scenes

Zhuopeng Li,Chenming Wu,Liangjun Zhang,Jianke Zhu

http://arxiv.org/abs/2311.16664v1

Compressor summary: The paper proposes a novel method called Density-Guided Neural Rendering (DGNR) that learns a density space from scenes to guide the construction of a point-based renderer for large-scale driving scenes, eliminating the need for geometric priors and achieving photorealistic and efficient rendering.


SCALAR-NeRF: SCAlable LARge-scale Neural Radiance Fields for Scene Reconstruction

Yu Chen,Gim Hee Lee

http://arxiv.org/abs/2311.16657v1

Compressor summary: SCALAR-NeRF is a framework that uses an encoder-decoder architecture and a coarse-to-fine strategy to reconstruct large-scale scenes efficiently and effectively, outperforming existing NeRF methods.


Pseudo-Likelihood Inference

Theo Gruner,Boris Belousov,Fabio Muratore,Daniel Palenicek,Jan Peters

http://arxiv.org/abs/2311.16656v1

Compressor summary: Pseudo-Likelihood Inference (PLI) is a new method that improves Simulation-Based Inference by combining neural approximation with likelihood kernel, making it better at handling challenging Bayesian system identification tasks on higher dimensions and dynamic systems.


Elucidating Discrepancy in Explanations of Predictive Models Developed using EMR

Aida Brankovic,Wenjie Huang,David Cook,Sankalp Khanna,Konstanty Bialkowski

http://arxiv.org/abs/2311.16654v1

Compressor summary: The study evaluates how well explainable AI methods align with expert medical knowledge in clinical decision support algorithms for EMR data, identifying discrepancies and suggesting ways to improve trustworthiness.


Augmenting x-ray single particle imaging reconstruction with self-supervised machine learning

Zhantao Chen,Cong Wang,Mingye Gao,Chun Hong Yoon,Jana B. Thayer,Joshua J. Turner

http://arxiv.org/abs/2311.16652v1

Compressor summary: The authors present a machine learning approach that improves Single Particle Imaging (SPI) with X-ray Free Electron Lasers (XFELs) by estimating particle orientations and reciprocal space intensities from diffraction images only.


Text2Tree: Aligning Text Representation to the Label Tree Hierarchy for Imbalanced Medical Classification

Jiahuan Yan,Haojun Gao,Zhang Kai,Weize Liu,Danny Chen,Jian Wu,Jintai Chen

http://arxiv.org/abs/2311.16650v1

Compressor summary: The paper proposes Text2Tree, a novel algorithm that uses internal label hierarchy in training deep learning models for medical text classification, improving performance on imbalanced and scarce data.


Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective

Ming-Yu Chung,Sheng-Yen Chou,Chia-Mu Yu,Pin-Yu Chen,Sy-Yen Kuo,Tsung-Yi Ho

http://arxiv.org/abs/2311.16646v1

Compressor summary: The study presents new trigger pattern generation methods for dataset distillation, which enable effective and hard-to-detect backdoor attacks.


Scaling Political Texts with ChatGPT

Gaël Le Mens,Aina Gallego

http://arxiv.org/abs/2311.16639v1

Compressor summary: The text describes how GPT-4 can estimate the positions of political texts on various dimensions accurately, quickly, and cheaply, comparing its performance with other methods.


Parallax-Tolerant Image Stitching with Epipolar Displacement Field

Jian Yu,Yi Yu,Feipeng Da

http://arxiv.org/abs/2311.16637v1

Compressor summary: The paper presents a new method for stitching large parallax images using epipolar geometry, which reduces alignment artifacts and maintains projectivity.


MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation

Sitong Su,Litao Guo,Lianli Gao,Hengtao Shen,Jingkuan Song

http://arxiv.org/abs/2311.16635v1

Compressor summary: MotionZero is a method for generating videos from text prompts without using motion information, exploiting the implied motion priors in the prompts to accurately and independently control the motion of different objects.


Outfit Completion via Conditional Set Transformation

Takuma Nakamura,Yuki Saito,Ryosuke Goto

http://arxiv.org/abs/2311.16630v1

Compressor summary: The paper introduces a new framework for outfit completion using deep neural networks and a conditional set transformation architecture that improves accuracy and scalability.


Gaussian Processes for Monitoring Air-Quality in Kampala

Clara Stoddart,Lauren Shrack,Richard Sserunjogi,Usman Abdul-Ganiy,Engineer Bainomugisha,Deo Okure,Ruth Misener,Jose Pablo Folch,Ruby Sedgwick

http://arxiv.org/abs/2311.16625v1

Compressor summary: The paper explores using Gaussian Processes to nowcast and forecast air pollution in Kampala, Uganda, where sensor coverage is limited.


On the Long Range Abilities of Transformers

Itamar Zimerman,Lior Wolf

http://arxiv.org/abs/2311.16620v1

Compressor summary: The authors propose modifications to the transformer architecture inspired by long-range layers, improving its performance on the Long Range Arena benchmark while maintaining simplicity and minimal additional computation.


Cross-level Attention with Overlapped Windows for Camouflaged Object Detection

Jiepan Li,Fangxiao Lu,Nan Xue,Zhuohong Li,Hongyan Zhang,Wei He

http://arxiv.org/abs/2311.16618v1

Compressor summary: The paper introduces a new method called OWinCA that enhances low-level features for detecting camouflaged objects using cross-level attention and an overlapped window partition strategy, achieving better results than existing methods.


Adversarial Distribution Balancing for Counterfactual Reasoning

Stefan Schrod,Fabian Sinz,Michael Altenbuchinger

http://arxiv.org/abs/2311.16616v1

Compressor summary: ADBCR is a machine learning method for counterfactual reasoning that uses potential outcome estimates to remove spurious causal relations and performs well on benchmark datasets, especially when using unlabeled validation data.


LasTGL: An Industrial Framework for Large-Scale Temporal Graph Learning

Jintang Li,Jiawang Dan,Ruofan Wu,Jing Zhou,Sheng Tian,Yunfei Liu,Baokun Wang,Changhua Meng,Weiqiang Wang,Yuchang Zhu,Liang Chen,Zibin Zheng

http://arxiv.org/abs/2311.16605v1

Compressor summary: LasTGL is an industrial framework that integrates implementations of common temporal graph learning algorithms to facilitate research and application in this emerging field.


Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity

Daeun Lee,Minhyeok Heo,Jiwon Kim

http://arxiv.org/abs/2311.16589v1

Compressor summary: The paper proposes a new framework for improving lane detection algorithms by using HD maps and generative models to increase data diversity without expanding the data volume.


MedGen: A Python Natural Language Processing Toolkit for Medical Text Processing

Rui Yang,Qingcheng Zeng,Keen You,Yujie Qiao,Lucas Huang,Chia-Chun Hsieh,Benjamin Rosand,Jeremy Goldwasser,Amisha D Dave,Tiarnan D. L. Keenan,Emily Y Chew,Dragomir Radev,Zhiyong Lu,Hua Xu,Qingyu Chen,Irene Li

http://arxiv.org/abs/2311.16588v1

Compressor summary: Summary: The MedGen NLP toolkit offers easy-to-use generative and basic NLP functions for biomedical researchers and healthcare professionals, with fine-tuned domain models and public availability.


Clean Label Disentangling for Medical Image Segmentation with Noisy Labels

Zicheng Wang,Zhen Zhao,Erjian Guo,Luping Zhou

http://arxiv.org/abs/2311.16580v1

Compressor summary: The authors propose a class-balanced sampling strategy and a noisy feature-aided clean label disentangling framework to address the noisy label issue in medical image segmentation, achieving state-of-the-art performance.


Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions

Xinhong Chen,Zongxi Li,Yaowei Wang,Haoran Xie,Jianping Wang,Qing Li

http://arxiv.org/abs/2311.16579v1

Compressor summary: The paper introduces a new task that identifies valid causal relationships between emotions and causes in texts, taking into account specific context clauses, and proposes a multi-task framework to handle this task.


Efficient Key-Based Adversarial Defense for ImageNet by Using Pre-trained Model

AprilPyone MaungMaung,Isao Echizen,Hitoshi Kiya

http://arxiv.org/abs/2311.16577v1

Compressor summary: The paper introduces key-based defense model proliferation using pre-trained models and efficient fine-tuning techniques for on-device image classification, improving accuracy by more than 10%.


MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

Yang Zhao,Yanwu Xu,Zhisheng Xiao,Tingbo Hou

http://arxiv.org/abs/2311.16567v1

Compressor summary: The paper introduces MobileDiffusion, an efficient text-to-image diffusion model with reduced size and fast inference speed, achieved through architecture optimization and sampling techniques.


DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

Peng Chen,Xiaobao Wei,Ming Lu,Yitong Zhu,Naiming Yao,Xingyu Xiao,Hui Chen

http://arxiv.org/abs/2311.16565v1

Compressor summary: The proposed DiffusionTalker method uses contrastive learning and knowledge distillation to personalize and speed up 3D facial animation based on speech input, overcoming limitations of existing diffusion-based approaches.


Scalable Label Distribution Learning for Multi-Label Classification

Xingyu Zhao,Yuexuan An,Lei Qi,Xin Geng

http://arxiv.org/abs/2311.16556v1

Compressor summary: SLDL is a novel multi-label classification method that uses continuous distributions in a low-dimensional latent space to model asymmetric label correlations, reducing computational complexity and achieving competitive performance.


Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

Ling Fu,Zijie Wu,Yingying Zhu,Yuliang Liu,Xiang Bai

http://arxiv.org/abs/2311.16555v1

Compressor summary: DiffText is a new method that uses a diffusion model to create realistic synthetic text images with less spelling errors and better background integration, improving scene text detection performance.


HandyPriors: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors

Shutong Zhang,Yi-Ling Qiao,Guanglei Zhu,Eric Heiden,Dylan Turpin,Jingzhou Liu,Ming Lin,Miles Macklin,Animesh Garg

http://arxiv.org/abs/2311.16552v1

Compressor summary: HandyPriors is a unified and general pipeline for pose estimation in human-object interaction scenes using differentiable physics and rendering priors, with two alternatives for hand and object pose estimation that achieve comparable or superior results and can be used for robotic manipulation and perception tasks.


Multi-Irreducible Spectral Synchronization for Robust Rotation Averaging

Owen Howell,Haoen Huang,David Rosen

http://arxiv.org/abs/2311.16544v1

Compressor summary: The authors propose a convex spectral relaxation method for estimating unknown orientations in robotics and computer vision, which has advantages over prior methods and provides performance guarantees under specific noise assumptions.


Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance

Siyu Xing,Jie Cao,Huaibo Huang,Xiao-Yu Zhang,Ran He

http://arxiv.org/abs/2311.16507v1

Compressor summary: StraightFM is a novel flow matching method that straightens trajectories using diffusion models and real data, resulting in higher quality images with fewer sampling steps.


Agents meet OKR: An Object and Key Results Driven Agent System with Hierarchical Self-Collaboration and Self-Evaluation

Yi Zheng,Chongyang Ma,Kanle Shi,Haibin Huang

http://arxiv.org/abs/2311.16542v1

Compressor summary: The OKR-Agent framework enhances Large Language Models' task-solving abilities by using self-collaboration, self-correction, and hierarchical agents to improve domain knowledge, reasoning, and execution structure.


Personalized Predictions of Glioblastoma Infiltration: Mathematical Models, Physics-Informed Neural Networks and Multimodal Scans

Ray Zirui Zhang,Ivan Ezhov,Michal Balcerak,Andy Zhu,Benedikt Wiestler,Bjoern Menze,John Lowengrub

http://arxiv.org/abs/2311.16536v1

Compressor summary: This paper proposes a method that uses Physics-Informed Neural Networks to estimate patient-specific parameters of a model of Glioblastoma growth from a single MRI scan, which could help in designing personalized radiotherapy treatment plans.


Graph Prompt Learning: A Comprehensive Survey and Beyond

Xiangguo Sun,Jiawen Zhang,Xixi Wu,Hong Cheng,Yun Xiong,Jia Li

http://arxiv.org/abs/2311.16534v1

Compressor summary: Key points: - The paper surveys the emerging domain of graph prompts in AGI, addressing challenges and opportunities - It proposes a unified framework for understanding graph prompt learning and categorizes over 100 works in this field - It presents ProG, a Python library and website, to support research in graph prompting - It discusses current challenges and future directions, offering a roadmap for research in graph prompting within AGI Summary: The paper provides an overview of graph prompts in AGI, introducing a framework, a library, and a roadmap for this new domain.


On robust overfitting: adversarial training induced distribution matters

Runzhi Tian,Yongyi Mao

http://arxiv.org/abs/2311.16526v1

Compressor summary: Adversarial training causes robust overfitting and this paper investigates its correlation with perturbation-induced distributions and proposes a new upper bound for generalization error using local dispersion.


3D Teeth Reconstruction from Panoramic Radiographs using Neural Implicit Functions

Sihwa Park,Seongjun Kim,In-Seok Song,Seung Jun Baek

http://arxiv.org/abs/2311.16524v1

Compressor summary: Occudent is a novel framework that uses neural implicit functions to reconstruct 3D teeth shapes from panoramic radiographs, outperforming existing methods.


Evaluation of dynamic characteristics of power grid based on GNN and application on knowledge graph

Hao Pei,Si Lin,Chuanfu Li,Che Wang,Haoming Chen,Sizhe Li

http://arxiv.org/abs/2311.16522v1

Compressor summary: The text describes a new method using a graph neural network that can accurately detect and analyze faults in power grids with high accuracy and insightful results.


B-LSTM-MIONet: Bayesian LSTM-based Neural Operators for Learning the Response of Complex Dynamical Systems to Length-Variant Multiple Input Functions

Zhihao Kong,Amirhossein Mollaali,Christian Moya,Na Lu,Guang Lin

http://arxiv.org/abs/2311.16519v1

Compressor summary: B-LSTM-MIONet is a redesigned framework that combines MIONet, LSTM, and Bayesian methods to learn neural operators from time-dependent data, handling variable-length real-time data and providing uncertainty quantification for complex systems modeling.


StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models

Kazuki Yamauchi,Yusuke Ijima,Yuki Saito

http://arxiv.org/abs/2311.16509v1

Compressor summary: StyleCap is a method to generate natural language descriptions of speaking styles in speech using neural networks, paired data, and large language models.


Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net

Zizhao Hu,Shaochong Jia,Mohammad Rostami

http://arxiv.org/abs/2311.16488v1

Compressor summary: The paper introduces PS-U-Net, an efficient multimodal diffusion model that preserves modality-specific details and a new multimodal sampling method for conditional generation of text and image data with higher quality.


A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis

Zixiang Zhou,Yu Wan,Baoyuan Wang

http://arxiv.org/abs/2311.16471v1

Compressor summary: The paper presents a scalable method to generate multimodal and multi-part human motion using codebooks and pre-trained models.


AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond

Zixiang Zhou,Yu Wan,Baoyuan Wang

http://arxiv.org/abs/2311.16468v1

Compressor summary: AvatarGPT is an all-in-one framework that uses a large language model to perform various motion-related tasks, such as understanding, planning, and generating human motions, by treating each task as an instruction fine-tuned on the shared model.


TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

Jingye Chen,Yupan Huang,Tengchao Lv,Lei Cui,Qifeng Chen,Furu Wei

http://arxiv.org/abs/2311.16465v1

Compressor summary: TextDiffuser-2 is a method that uses a large language model to improve the flexibility, automation, and style diversity of text rendering in diffusion models.


Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Yicheng Xiao,Zhuoyan Luo,Yong Liu,Yue Ma,Hengwei Bian,Yatai Ji,Yujiu Yang,Xiu Li

http://arxiv.org/abs/2311.16464v1

Compressor summary: UVCOM is a framework that effectively combines Video Moment Retrieval and Highlight Detection tasks by integrating multi-granularity, intra and inter-modality, and multi-aspect contrastive learning.


Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information

Jie Li,Zhixin Li,Zhi Liu,Pengyuan Zhou,Richang Hong,Qiyue Li,Han Hu

http://arxiv.org/abs/2311.16462v1

Compressor summary: The paper proposes a novel method for improving viewport prediction in volumetric video streaming using saliency detection, trajectory information, and a new sampling technique.


Spiking Neural Networks with Dynamic Time Steps for Vision Transformers

Gourav Datta,Zeyu Liu,Anni Li,Peter A. Beerel

http://arxiv.org/abs/2311.16456v1

Compressor summary: The paper proposes a new training framework that adapts the number of time steps for each module in vision transformers, resulting in energy-efficient spiking neural networks for image recognition tasks.


Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Harsha Nori,Yin Tat Lee,Sheng Zhang,Dean Carignan,Richard Edgar,Nicolo Fusi,Nicholas King,Jonathan Larson,Yuanzhi Li,Weishung Liu,Renqian Luo,Scott Mayer McKinney,Robert Osazuwa Ness,Hoifung Poon,Tao Qin,Naoto Usuyama,Chris White,Eric Horvitz

http://arxiv.org/abs/2311.16452v1

Compressor summary: The authors explore prompt engineering techniques with GPT-4 to unlock its specialist capabilities in various domains, achieving state-of-the-art results on medical benchmarks and outperforming specialist models.


Typhoon Intensity Prediction with Vision Transformer

Huanxin Chen,Pengshuai Yin,Huichou Huang,Qingyao Wu,Ruirui Liu,Xiatian Zhu

http://arxiv.org/abs/2311.16450v1

Compressor summary: The Typhoon Intensity Transformer (Tint) uses self-attention mechanisms to capture local and global contextual relations in satellite images, improving typhoon intensity prediction accuracy.


Centre Stage: Centricity-based Audio-Visual Temporal Action Detection

Hanyuan Wang,Majid Mirmehdi,Dima Damen,Toby Perrett

http://arxiv.org/abs/2311.16446v1

Compressor summary: The paper proposes a novel method to improve one-stage action detection by fusing visual and audio modalities, using multi-scale cross-attention and a centricity score that estimates the proximity of timesteps to the action centre, achieving state-of-the-art performance on EPIC-Kitchens-100.


CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models

Yichao Cai,Yuhang Liu,Zhen Zhang,Javen Qinfeng Shi

http://arxiv.org/abs/2311.16445v1

Compressor summary: The study proposes a method to improve vision-language models' resilience against perturbations by modifying text data's style while preserving its content, without retraining the image encoder on adversarial examples.


Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Takehiko Ohkawa,Takuma Yagi,Taichi Nishimura,Ryosuke Furuta,Atsushi Hashimoto,Yoshitaka Ushiku,Yoichi Sato

http://arxiv.org/abs/2311.16444v1

Compressor summary: The paper introduces a novel benchmark for transferring knowledge from exocentric web videos to dense video captioning of egocentric videos using adversarial training, addressing the challenge of dynamic view changes between these two domains.


Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization

Jinhao Li,Shiyao Li,Jiaming Xu,Shan Huang,Yaoxiu Lian,Jun Liu,Yu Wang,Guohao Dai

http://arxiv.org/abs/2311.16442v1

Compressor summary: The paper proposes techniques to improve the accuracy and efficiency of large language models by adjusting the bit width of quantization and optimizing dequantization operations on GPUs.


Text-Driven Image Editing via Learnable Regions

Yuanze Lin,Yi-Wen Chen,Yi-Hsuan Tsai,Lu Jiang,Ming-Hsuan Yang

http://arxiv.org/abs/2311.16432v1

Compressor summary: The paper presents a text-to-image editing method that uses bounding boxes to find edit regions based on textual prompts, achieving high fidelity and realism with complex prompts.


Manifold Preserving Guided Diffusion

Yutong He,Naoki Murata,Chieh-Hsin Lai,Yuhta Takida,Toshimitsu Uesaka,Dongjun Kim,Wei-Hsiang Liao,Yuki Mitsufuji,J. Zico Kolter,Ruslan Salakhutdinov,Stefano Ermon

http://arxiv.org/abs/2311.16424v1

Compressor summary: MPGD is a training-free method for conditional image generation that uses diffusion models, neural networks, and pretrained autoencodters, achieving efficiency and effectiveness with speed-ups and high sample quality.


CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

Yuhang Wang,Yanxu Zhu,Chao Kong,Shuyu Wei,Xiaoyuan Yi,Xing Xie,Jitao Sang

http://arxiv.org/abs/2311.16421v1

Compressor summary: The paragraph introduces CDEval, a new benchmark to evaluate cultural dimensions of Large Language Models (LLMs), emphasizing the importance of cultural considerations in their development and applications.


Model-free Test Time Adaptation for Out-Of-Distribution Detection

YiFan Zhang,Xue Wang,Tian Zhou,Kun Yuan,Zhang Zhang,Liang Wang,Rong Jin,Tieniu Tan

http://arxiv.org/abs/2311.16420v1

Compressor summary: The Non-Parametric Test Time Adaptation framework for Out-Of-Distribution Detection (NPTTA) is a method that adapts to changing data distributions during testing and uses detected out-of-distribution samples to improve reliability, achieving better performance than existing methods.


Deep Learning for Time Series Classification of Parkinson's Disease Eye Tracking Data

Gonzalo Uribarri,Simon Ekman von Huth,Josefine Waldthaler,Per Svenningsson,Erik Fransén

http://arxiv.org/abs/2311.16381v1

Compressor summary: The authors investigate using deep learning algorithms to classify Parkinson's disease from eye-tracking data during saccade experiments, achieving high accuracy with raw fixation interval inputs.