arxiv compressed, 2023-11-30

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2023-11-30 generated by the compressor, my personal LLM-based project.


Dataset Distillation in Large Data Era

Zeyuan Yin,Zhiqiang Shen

http://arxiv.org/abs/2311.18838v1

Compressor summary: The text describes a new method called CDA that improves the accuracy of distilling large datasets like ImageNet-1K and 21K under a standard resolution, outperforming previous approaches and reducing the gap to full-data training.


TrafficMOT: A Challenging Dataset for Multi-Object Tracking in Complex Traffic Scenarios

Lihao Liu,Yanqi Cheng,Zhongying Deng,Shujun Wang,Dongdong Chen,Xiaowei Hu,Pietro Liò,Carola-Bibiane Schönlieb,Angelica Aviles-Rivero

http://arxiv.org/abs/2311.18839v1

Compressor summary: The text introduces TrafficMOT, a diverse and complex dataset for multi-object tracking in traffic videos, which can improve traffic monitoring accuracy and road safety by testing advanced machine learning algorithms.


Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

Dominick Reilly,Srijan Das

http://arxiv.org/abs/2311.18840v1

Compressor summary: The PI-ViT (or $\pi$-ViT) is a new approach that improves video transformers for human action recognition in Activities of Daily Living by adding 2D and 3D pose information to RGB images, achieving state-of-the-art performance without poses or extra computation during inference.


PoseGPT: Chatting about 3D Human Pose

Yao Feng,Jing Lin,Sai Kumar Dwivedi,Yu Sun,Priyanka Patel,Michael J. Black

http://arxiv.org/abs/2311.18836v1

Compressor summary: PoseGPT is a framework that uses large language models to understand and generate 3D human poses from images or text descriptions, enabling advanced tasks like speculative pose generation and reasoning about pose estimation.


VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models

Zhen Xing,Qi Dai,Zihao Zhang,Hui Zhang,Han Hu,Zuxuan Wu,Yu-Gang Jiang

http://arxiv.org/abs/2311.18837v1

Compressor summary: Video Instruction Diffusion (VIDiff) is a fast and versatile model that edits and enhances long videos based on written instructions, using diffusion models for various video tasks.


InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

Rongyao Fang,Shilin Yan,Zhaoyang Huang,Jingqiu Zhou,Hao Tian,Jifeng Dai,Hongsheng Li

http://arxiv.org/abs/2311.18835v1

Compressor summary: InstructSeq is a framework that uses natural language instructions to control diverse vision tasks, enabling more versatile and human-like artificial intelligence.


ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Wenming Weng,Ruoyu Feng,Yanhui Wang,Qi Dai,Chunyu Wang,Dacheng Yin,Zhiyuan Zhao,Kai Qiu,Jianmin Bao,Yuhui Yuan,Chong Luo,Yueyi Zhang,Zhiwei Xiong

http://arxiv.org/abs/2311.18834v1

Compressor summary: ART$\boldsymbol{\cdot}$V is a framework for generating videos frame by frame with diffusion models, avoiding complex motions, preserving high-fidelity, and handling drifting issues, enabling various applications and quality.


Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

Hsin-Ying Lee,Hung-Yu Tseng,Hsin-Ying Lee,Ming-Hsuan Yang

http://arxiv.org/abs/2311.18832v1

Compressor summary: The paper proposes a new pipeline that uses pre-trained text-to-image models to predict pixel-level properties from images, while addressing the domain gap and making the process deterministic.


MotionEditor: Editing Video Motion via Content-Aware Diffusion

Shuyuan Tu,Qi Dai,Zhi-Qi Cheng,Han Hu,Xintong Han,Zuxuan Wu,Yu-Gang Jiang

http://arxiv.org/abs/2311.18830v1

Compressor summary: MotionEditor is a diffusion model for video motion editing that incorporates a content-aware motion adapter into ControlNet to preserve the original background and protagonist appearance while modifying the motion information.


MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Yanhui Wang,Jianmin Bao,Wenming Weng,Ruoyu Feng,Dacheng Yin,Tao Yang,Jingxu Zhang,Qi Dai Zhiyuan Zhao,Chunyu Wang,Kai Qiu,Yuhui Yuan,Xiaoyan Sun,Chong Luo,Baining Guo

http://arxiv.org/abs/2311.18829v1

Compressor summary: MicroCinema is a framework that generates coherent text-to-video by dividing the process into two stages and using advanced image models to enhance appearance preservation and motion dynamics.


One-step Diffusion with Distribution Matching Distillation

Tianwei Yin,Michaël Gharbi,Richard Zhang,Eli Shechtman,Frédo Durand,William T. Freeman,Taesung Park

http://arxiv.org/abs/2311.18828v1

Compressor summary: DMD is a method that turns diffusion models into one-step image generators with minimal quality loss by matching distributions using score functions and a regression loss, achieving competitive results compared to other few-step diffusion approaches while being much faster.


Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal Inference

Kaiwen Hou

http://arxiv.org/abs/2311.18826v1

Compressor summary: The paper introduces a new method for causal inference using continuous normalizing flows that improves efficiency, robustness, and geometric properties of parametric submodels.


CAST: Cross-Attention in Space and Time for Video Action Recognition

Dongho Lee,Jongseo Lee,Jinwoo Choi

http://arxiv.org/abs/2311.18825v1

Compressor summary: The CAST model uses RGB input to achieve a balanced spatio-temporal understanding of videos for action recognition and outperforms existing methods on various benchmarks.


Initializing Models with Larger Ones

Zhiqiu Xu,Yanjie Chen,Kirill Vishniakov,Yida Yin,Zhiqiang Shen,Trevor Darrell,Lingjie Liu,Zhuang Liu

http://arxiv.org/abs/2311.18823v1

Compressor summary: Weight selection is a method to initialize smaller neural networks by selecting some weights from a pretrained larger model, improving their performance and reducing training time.


ElasticDiffusion: Training-free Arbitrary Size Image Generation

Moayed Haji-Ali,Guha Balakrishnan,Vicente Ordonez

http://arxiv.org/abs/2311.18822v1

Compressor summary: ElasticDiffusion is a training-free method that allows text-to-image diffusion models to generate images with various sizes by decoupling the generation trajectory into local and global signals, resulting in better image coherence quality compared to other methods.


Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

Kaifeng Lyu,Jikai Jin,Zhiyuan Li,Simon S. Du,Jason D. Lee,Wei Hu

http://arxiv.org/abs/2311.18817v1

Compressor summary: The paper investigates the "grokking" phenomenon in neural networks, where training accuracy is initially poor but improves drastically after a certain point, and attributes it to implicit biases during the learning process.


IMMA: Immunizing text-to-image Models against Malicious Adaptation

Yijia Zheng,Raymond A. Yeh

http://arxiv.org/abs/2311.18815v1

Compressor summary: The text discusses a new method called IMMA that protects image models from generating harmful or unauthorized content by making them difficult to fine-tune with malicious adaptations.


Is Underwater Image Enhancement All Object Detectors Need?

Yudong Wang,Jichang Guo,Wanru He,Huan Gao,Huihui Yue,Zenan Zhang,Chongyi Li

http://arxiv.org/abs/2311.18814v1

Compressor summary: The authors study how 18 underwater image enhancement algorithms affect the performance of 7 object detectors on underwater object detection tasks, using a total of 133 models.


What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

Raphael Tang,Xinyu Zhang,Jimmy Lin,Ferhan Ture

http://arxiv.org/abs/2311.18812v1

Compressor summary: The study uses a probe to examine sociodemographic biases in large language models' latent representations, even when they refuse to respond.


Convergence of Nonconvex PnP-ADMM with MMSE Denoisers

Chicago Park,Shirin Shoushtari,Weijie Gan,Ulugbek S. Kamilov

http://arxiv.org/abs/2311.18810v1

Compressor summary: The paper explains why PnP-ADMM works well with expansive CNNs by relating them to MMSE denoisers and proximal operators.


FoundPose: Unseen Object Pose Estimation with Foundation Features

Evin Pınar Örnek,Yann Labbé,Bugra Tekin,Lingni Ma,Cem Keskin,Christian Forster,Tomas Hodan

http://arxiv.org/abs/2311.18809v1

Compressor summary: FoundPose is a 6D pose estimation method for unseen rigid objects from a single RGB image that uses DINOv2, a vision foundation model, and performs well on the BOP benchmark.


Pre-registration for Predictive Modeling

Jake M. Hofman,Angelos Chatzimparmpas,Amit Sharma,Duncan J. Watts,Jessica Hullman

http://arxiv.org/abs/2311.18807v1

Compressor summary: The authors explore using pre-registration, a practice from explanatory modeling, to improve reproducibility and reliability in predictive modeling by preventing biased estimates and data-dependent decision-making.


Efficient Baseline for Quantitative Precipitation Forecasting in Weather4cast 2023

Akshay Punjabi,Pablo Izquierdo Ayala

http://arxiv.org/abs/2311.18806v1

Compressor summary: The authors propose a minimalist U-Net model for accurate precipitation forecasting that considers the environmental impact of computational resources.


Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text

Qi Cao,Takeshi Kojima,Yutaka Matsuo,Yusuke Iwasawa

http://arxiv.org/abs/2311.18805v1

Compressor summary: The study examines the ability of large language models, especially GPT-4, to handle and recover from scrambled sentences and finds that they can almost perfectly reconstruct original sentences even when all letters within each word are scrambled.


BIOCLIP: A Vision Foundation Model for the Tree of Life

Samuel Stevens,Jiaman Wu,Matthew J Thompson,Elizabeth G Campolongo,Chan Hee Song,David Edward Carlyn,Li Dong,Wasila M Dahdul,Charles Stewart,Tanya Berger-Wolf,Wei-Lun Chao,Yu Su

http://arxiv.org/abs/2311.18803v1

Compressor summary: The authors present TreeOfLife-10M, a large dataset of biology images, and BioCLIP, a foundation model for the tree of life that uses computer vision to extract information from these images, achieving superior performance on various biology classification tasks.


Distributed Global Structure-from-Motion with a Deep Front-End

Ayush Baid,John Lambert,Travis Driver,Akshay Krishnan,Hayk Stepanyan,Frank Dellaert

http://arxiv.org/abs/2311.18801v1

Compressor summary: The authors compare global and incremental Structure-from-Motion methods using a modular framework that combines recent developments in feature extraction and matching, but find that SIFT features still perform best for incremental SfM.


X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

Artemis Panagopoulou,Le Xue,Ning Yu,Junnan Li,Dongxu Li,Shafiq Joty,Ran Xu,Silvio Savarese,Caiming Xiong,Juan Carlos Niebles

http://arxiv.org/abs/2311.18799v1

Compressor summary: The paper introduces a cross-modality framework for integrating various modalities without extensive customization, using frozen large language models and collecting high-quality instruction tuning data.


MultiResFormer: Transformer with Adaptive Multi-Resolution Modeling for General Time Series Forecasting

Linfeng Du,Ji Xin,Alex Labach,Saba Zuberi,Maksims Volkovs,Rahul G. Krishnan

http://arxiv.org/abs/2311.18780v1

Compressor summary: MultiResFormer is a new transformer-based model that adapts to different periodicity in time series data and achieves better performance on long-term forecasting tasks compared to existing methods.


Mavericks at BLP-2023 Task 1: Ensemble-based Approach Using Language Models for Violence Inciting Text Detection

Saurabh Page,Sudeep Mangalvedhekar,Kshitij Deshpande,Tanmay Chavan,Sheetal Sonawane

http://arxiv.org/abs/2311.18778v1

Compressor summary: The paper describes a study on detecting violence-inciting texts in Bangla, using BERT-based models and achieving an ensemble F1 score of 0.737.


CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

Zineng Tang,Ziyi Yang,Mahmoud Khademi,Yang Liu,Chenguang Zhu,Mohit Bansal

http://arxiv.org/abs/2311.18775v1

Compressor summary: CoDi-2 is an advanced AI system that can understand and generate various types of inputs and outputs, such as text, vision, and audio, by following complex instructions and examples.


Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Rohan Myer Krishnan,Zitian Tang,Zhiqiu Yu,Chen Sun

http://arxiv.org/abs/2311.18773v1

Compressor summary: Spacewalk-18 is a benchmark for evaluating video-language models' ability to learn skills from human demonstrations in long and multimodal spacewalk videos, which challenges current methods.


Online Change Points Detection for Linear Dynamical Systems with Finite Sample Guarantees

Lei Xin,George Chiu,Shreyas Sundaram

http://arxiv.org/abs/2311.18769v1

Compressor summary: The paper proposes an online change point detection method for linear systems with unknown dynamics and temporal correlations that can achieve a pre-specified false alarm bound and provides finite-sample-based guarantees on detection probability and delay.


MLLMs-Augmented Visual-Language Representation Learning

Yanqing Liu,Kai Wang,Wenqi Shao,Ping Luo,Yu Qiao,Mike Zheng Shou,Kaipeng Zhang,Yang You

http://arxiv.org/abs/2311.18765v1

Compressor summary: This paper shows that using multi-modal large language models to generate multiple captions for images improves visual-language representation learning and image-text retrieval performance.


Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

James Seale Smith,Yen-Chang Hsu,Zsolt Kira,Yilin Shen,Hongxia Jin

http://arxiv.org/abs/2311.18763v1

Compressor summary: STAMINA is a novel method that improves text-to-image diffusion models for sequential concept learning by using low-ranked attention-masked adapters and customized MLP tokens, achieving better performance on 50-concept landmarks and human faces benchmarks.


Can training neural language models on a curriculum with developmentally plausible data improve alignment with human reading behavior?

Aryaman Chobey,Oliver Smith,Anzi Wang,Grusha Prasad

http://arxiv.org/abs/2311.18761v1

Compressor summary: The paper investigates if using a curriculum based on sentence-level surprisal estimates from teacher models trained on the BabyLM dataset can improve linguistic knowledge acquisition in neural language models, but finds that it does not result in better alignment with human behavior.


TaskBench: Benchmarking Large Language Models for Task Automation

Yongliang Shen,Kaitao Song,Xu Tan,Wenqi Zhang,Kan Ren,Siyu Yuan,Weiming Lu,Dongsheng Li,Yueting Zhuang

http://arxiv.org/abs/2311.18760v1

Compressor summary: TaskBench is a system for evaluating large language models' ability in task automation by decomposing tasks into sub-tasks, invoking tools, and predicting parameters based on user instructions.


Semi-supervised Semantic Segmentation via Boosting Uncertainty on Unlabeled Data

Daoan Zhang,Yunhao Luo,Jianguo Zhang

http://arxiv.org/abs/2311.18758v1

Compressor summary: The paper proposes a new method for semi-supervised semantic segmentation that boosts uncertainty on unlabeled data to reduce the gap between labeled and unlabeled datasets, improving model generalization and achieving state-of-the-art results.


Language Model Agents Suffer from Compositional Generalization in Web Automation

Hiroki Furuta,Yutaka Matsuo,Aleksandra Faust,Izzeddin Gur

http://arxiv.org/abs/2311.18751v1

Compressor summary: The paragraph discusses a new benchmark called CompWoB that tests language model agents' ability to perform compositional web automation tasks, showing their performance degrades when tasks combine or change order.


TransCORALNet: A Two-Stream Transformer CORAL Networks for Supply Chain Credit Assessment Cold Start

Jie Shi,Arno P. J. M. Siebes,Siamak Mehrkanoon

http://arxiv.org/abs/2311.18749v1

Compressor summary: The paper presents a novel interpretable two-stream transformer model, TransCORALNet, for accurate supply chain credit assessment under segment industry and cold start problems using domain adaptation and LIME explanations.


A data-science pipeline to enable the Interpretability of Many-Objective Feature Selection

Uchechukwu F. Njoku,Alberto Abelló,Besim Bilalli,Gianluca Bontempi

http://arxiv.org/abs/2311.18746v1

Compressor summary: The paper introduces a new method to help data scientists choose the best feature subset from many options by combining visualization and post-processing techniques for MOFS outcomes.


AlignBench: Benchmarking Chinese Alignment of Large Language Models

Xiao Liu,Xuanyu Lei,Shengyuan Wang,Yue Huang,Zhuoer Feng,Bosi Wen,Jiale Cheng,Pei Ke,Yifan Xu,Weng Lam Tam,Xiaohan Zhang,Lichao Sun,Hongning Wang,Jing Zhang,Minlie Huang,Yuxiao Dong,Jie Tang

http://arxiv.org/abs/2311.18743v1

Compressor summary: The paragraph introduces AlignBench, a benchmark for evaluating Chinese LLMs' alignment that uses human-in-the-loop data curation and a companion evaluator LLM called CritiqueLLM.


Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through Dialect Identification using Transformer-based Approach

Vedant Deshpande,Yash Patwardhan,Kshitij Deshpande,Sudeep Mangalvedhekar,Ravindra Murumkar

http://arxiv.org/abs/2311.18739v1

Compressor summary: The paper presents a method for country-level Arabic dialect identification using pre-trained transformer models and ensembling, achieving 76.65 F1-score at the NADI 2023 Shared Task.


Dimension Mixer: A Generalized Method for Structured Sparsity in Deep Neural Networks

Suman Sapkota,Binod Bhattarai

http://arxiv.org/abs/2311.18735v1

Compressor summary: The paper explores how different neural architectures can be improved by using dimension mixing techniques inspired by the Fast Fourier Transform and proposes new non-linear mixers for CNNs, Transformers, and MLP-Mixers.


Mavericks at ArAIEval Shared Task: Towards a Safer Digital Space -- Transformer Ensemble Models Tackling Deception and Persuasion

Sudeep Mangalvedhekar,Kshitij Deshpande,Yash Patwardhan,Vedant Deshpande,Ravindra Murumkar

http://arxiv.org/abs/2311.18730v1

Compressor summary: The paper describes an approach for two Arabic AI tasks related to persuasion and disinformation detection using pre-trained transformer models and ensembling, achieving top ranks on both tasks.


Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

Yu Deng,Duomin Wang,Xiaohang Ren,Xingyu Chen,Baoyuan Wang

http://arxiv.org/abs/2311.18729v1

Compressor summary: The method learns one-shot 4D head synthesis from monocular videos using large-scale synthetic data generated by a 4D generative model and a transformer-based reconstructor, with a novel learning strategy for better generalization to real images.


Steering Deep Feature Learning with Backward Aligned Feature Updates

Lénaïc Chizat,Praneeth Netrapalli

http://arxiv.org/abs/2311.18718v1

Compressor summary: The paper proposes a way to predict, measure, and control feature learning in deep learning by aligning feature updates with the backward pass, and studies its implications on hyperparameter tuning and neural network architectures.


CoRec: An Easy Approach for Coordination Recognition

Qing Wang,Haojie Jia,Wenfei Song,Qi Li

http://arxiv.org/abs/2311.18712v1

Compressor summary: The paper presents a new model called CoRec, which uses two components to identify coordinators and conjunct boundaries in sentences more effectively and efficiently than existing methods that rely on syntactic parsers.


Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling

Matúš Pikuliak,Andrea Hrckova,Stefan Oresko,Marián Šimko

http://arxiv.org/abs/2311.18711v1

Compressor summary: GEST is a new dataset for evaluating how well AI systems understand gender stereotypes across 9 Slavic languages and English, finding widespread stereotypical reasoning.


Meta-Prior: Meta learning for Adaptive Inverse Problem Solvers

Matthieu Terris,Thomas Moreau

http://arxiv.org/abs/2311.18710v1

Compressor summary: The paper proposes a meta-learning approach for imaging inverse problems that can handle unsupervised settings, fine-tune to specific tasks, and recover the Bayes optimal estimator with few fine-tuning steps.


Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Daniel Jarne Ornia,Giannis Delimpaltadakis,Jens Kober,Javier Alonso-Mora

http://arxiv.org/abs/2311.18703v1

Compressor summary: PA-RL is a method that makes RL agents more predictable by using state sequence entropy rate as a measure and applying policy-dependent and action-dependent rewards based on entropy.


CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

Pei Ke,Bosi Wen,Zhuoer Feng,Xiao Liu,Xuanyu Lei,Jiale Cheng,Shengyuan Wang,Aohan Zeng,Yuxiao Dong,Hongning Wang,Jie Tang,Minlie Huang

http://arxiv.org/abs/2311.18702v1

Compressor summary: The authors propose CritiqueLLM, a new critique generation model that uses dialogue-based prompting for high-quality evaluation data and shows promising scaling properties compared to GPT-4 in evaluating large language models.


Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction

Cheng Sun,Wei-En Tai,Yu-Lin Shih,Kuan-Wei Chen,Yong-Jing Syu,Kent Selwyn The,Yu-Chiang Frank Wang,Hwann-Tzong Chen

http://arxiv.org/abs/2311.18695v1

Compressor summary: The paper introduces Seg2Reg, a method that combines 1D regression and 2D segmentation for room layout reconstruction using density fields and volume rendering, improving accuracy and generalization with a strong baseline model.


Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning

Jared Markowitz,Jesse Silverberg,Gary Collins

http://arxiv.org/abs/2311.18684v1

Compressor summary: The paper proposes new off-policy actor-critic methods for reinforcement learning with mixed-sign rewards, which improve sample efficiency and performance over existing approaches.


RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

Chantal Pellegrini,Ege Özsoy,Benjamin Busam,Nassir Navab,Matthias Keicher

http://arxiv.org/abs/2311.18681v1

Compressor summary: RaDialog is a conversational AI tool that generates accurate radiology reports for medical images and can interactively answer questions or correct errors, advancing the field of radiology.


Cascaded Interaction with Eroded Deep Supervision for Salient Object Detection

Hewen Xiao,Jie Mei,Guangfu Ma,Weiren Wu

http://arxiv.org/abs/2311.18675v1

Compressor summary: The article proposes a novel network structure and a deep supervision strategy to address information distortion caused by interpolation in deep convolutional neural networks for salient object detection.


Action Recognition in Video Recordings from Gynecologic Laparoscopy

Sahar Nasirihaghighi,Negin Ghamsarian,Daniela Stefanics,Klaus Schoeffmann,Heinrich Husslein

http://arxiv.org/abs/2311.18666v1

Compressor summary: The authors propose a new network and framework for recognizing actions in laparoscopic surgeries, which handle challenges such as content distortion, duration variation, and scene variations using recurrent layers and frame sampling.


Pose Estimation and Tracking for ASIST

Ari Goodman,Gurpreet Singh,Ryan O'Shea,Peter Teague,James Hing

http://arxiv.org/abs/2311.18665v1

Compressor summary: The ASIST system for safely arresting helicopters on ships was improved by a research project called PETA, which developed a computer vision prototype that can track helicopters in real-time without hardware installation requirements.


Multi-task learning with cross-task consistency for improved depth estimation in colonoscopy

Pedro Esteban Chavarrias Solano,Andrew Bulpitt,Venkataraman Subramanian,Sharib Ali

http://arxiv.org/abs/2311.18664v1

Compressor summary: The authors propose a multi-task learning approach using surface normal prediction and attention mechanisms to improve depth estimation in colonoscopy videos.


Learning Part Segmentation from Synthetic Animals

Jiawei Peng,Ju He,Prakhar Kaushik,Zihao Xiao,Jiteng Mu,Alan Yuille

http://arxiv.org/abs/2311.18661v1

Compressor summary: The paper presents a method to learn part segmentation from synthetic animals using SMAL models, improves domain adaptation with CB-FDM, and shows transferability across quadrupeds in PartImageNet.


ArcMMLU: A Library and Information Science Benchmark for Large Language Models

Shitou Zhang,Zuchao Li,Xingshen Liu,Liming Yang,Ping Wang

http://arxiv.org/abs/2311.18658v1

Compressor summary: The paper introduces ArcMMLU, a benchmark to evaluate large language models' knowledge and reasoning in the Chinese Library & Information Science domain.


Detailed Human-Centric Text Description-Driven Large Scene Synthesis

Gwanghyun Kim,Dong Un Kang,Hoigi Seo,Hayeon Kim,Se Young Chun

http://arxiv.org/abs/2311.18654v1

Compressor summary: DetText2Scene is a novel method for generating large-scale images from detailed human-centric text descriptions with high faithfulness, controllability, and naturalness in a global context.


LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

Sijin Chen,Xin Chen,Chi Zhang,Mingsheng Li,Gang Yu,Hao Fei,Hongyuan Zhu,Jiayuan Fan,Tao Chen

http://arxiv.org/abs/2311.18651v1

Compressor summary: The paper introduces LL3DA, a language model that can understand and interact with point cloud 3D scenes directly, improving human-machine communication in complex environments.


Simple Semantic-Aided Few-Shot Learning

Hai Zhang,Junzhe Xu,Shanlin Jiang,Zhenan He

http://arxiv.org/abs/2311.18649v1

Compressor summary: The paper proposes Semantic Evolution to generate high-quality semantics for few-shot learning and shows that a simple two-layer network with these semantics outperforms previous methods.


Stochastic Vision Transformers with Wasserstein Distance-Aware Attention

Franciskus Xaverius Erick,Mina Rezaei,Johanna Paula Müller,Bernhard Kainz

http://arxiv.org/abs/2311.18645v1

Compressor summary: The authors propose a novel stochastic vision transformer that incorporates uncertainty and distance awareness into self-supervised learning pipelines using Wasserstein distance-based attention and regularization, leading to improved performance on various tasks.


Exploring the hierarchical structure of human plans via program generation

Carlos G. Correa,Sophia Sanborn,Mark K. Ho,Frederick Callaway,Nathaniel D. Daw,Thomas L. Griffiths

http://arxiv.org/abs/2311.18644v1

Compressor summary: The paper studies how people create hierarchical plans using a programming task, and finds that humans prefer reusable programs over shorter ones.


DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

Tobias Kirschstein,Simon Giebenhain,Matthias Nießner

http://arxiv.org/abs/2311.18635v1

Compressor summary: The authors propose DiffusionAvatars, a diffusion-based neural renderer that creates high-fidelity 3D head avatars of people with intuitive control over pose and expression using a neural parametric head model, cross-attention, and TriPlane lookup.


A Lightweight Clustering Framework for Unsupervised Semantic Segmentation

Yau Shing Jonathan Cheung,Xi Chen,Lihe Yang,Hengshuang Zhao

http://arxiv.org/abs/2311.18628v1

Compressor summary: The paper proposes a lightweight clustering method using self-supervised vision transformer features for unsupervised semantic segmentation, achieving state-of-the-art results on two datasets.


JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

Shishir Muralidhara,Sravan Kumar Jagadeesh,René Schuster,Didier Stricker

http://arxiv.org/abs/2311.18618v1

Compressor summary: The paper introduces Joint Panoptic Part Fusion (JPPF), a method to combine semantic areas, object instances, and semantic parts in computer vision, which is evaluated on two datasets and shows fair fusion and generalization without fine-tuning.


Anatomy and Physiology of Artificial Intelligence in PET Imaging

Tyler J. Bradshaw,Alan B. McMillan

http://arxiv.org/abs/2311.18614v1

Compressor summary: The article aims to educate readers about AI principles, focusing on aspects relevant to PET imaging using examples like convolutional neural networks and U-Net.


DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

Daoyi Gao,Dávid Rozenberszki,Stefan Leutenegger,Angela Dai

http://arxiv.org/abs/2311.18610v1

Compressor summary: DiffCAD is a weakly-supervised probabilistic method that learns to reconstruct 3D objects from RGB images using diffusion and multiple plausible CAD models, achieving competitive performance even on real data without supervision.


ArthModel: Enhance Arithmetic Skills to Large Language Model

Yingdi Guo

http://arxiv.org/abs/2311.18609v1

Compressor summary: The paper presents a method to improve the arithmetic capabilities of large language models by combining them with small pretrained models and prompt injection, addressing limitations like toxicity and pool performance.


Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing

Hyelin Nam,Gihyun Kwon,Geon Yeong Park,Jong Chul Ye

http://arxiv.org/abs/2311.18608v1

Compressor summary: The Contrastive Denoising Score (CDS) technique improves image editing by preserving structural details and transforming content in latent diffusion models using intermediate features from self-attention layers.


Learning Triangular Distribution in Visual World

Ping Chen,Xingpeng Zhang,Chengtao Zhou,Dichao Fan,Peng Tu,Le Zhang,Yanlin Qian

http://arxiv.org/abs/2311.18605v1

Compressor summary: The authors propose a method called Triangular Distribution Transform (TDT) to map feature discrepancies to label discrepancies in convolutional neural networks for label distribution learning tasks, improving performance and correctness.


Generalisable Agents for Neural Network Optimisation

Kale-ab Tessera,Callum Rhys Tilbury,Sasha Abramowitz,Ruan de Kock,Omayma Mahjoub,Benjamin Rosman,Sara Hooker,Arnu Pretorius

http://arxiv.org/abs/2311.18598v1

Compressor summary: GANNO is a MARL approach that learns to improve neural network optimization by dynamically scheduling hyperparameters at a layerwise level using agents per layer.


Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models

Dong Li,Jiandong Jin,Yuhao Zhang,Yanlin Zhong,Yaoyang Wu,Lan Chen,Xiao Wang,Bin Luo

http://arxiv.org/abs/2311.18592v1

Compressor summary: The study presents a novel pattern recognition framework that fuses RGB frames, event streams, and semantic labels using large-scale vision-language models like CLIP.


Continuous 16-bit Training: Accelerating 32-bit Pre-Trained Neural Networks

Juyoung Yun

http://arxiv.org/abs/2311.18587v1

Compressor summary: The study proposes using 16-bit precision for ongoing training of 32-bit deep learning models, which improves speed without sacrificing accuracy and reduces computational resources.


FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

Shiyao Cui,Zhenyu Zhang,Yilong Chen,Wenyuan Zhang,Tianyun Liu,Siqi Wang,Tingwen Liu

http://arxiv.org/abs/2311.18580v1

Compressor summary: The paper introduces FFT, a new benchmark to evaluate the potential harms of large language models based on factuality, fairness, and toxicity.


Fingerprint Matching with Localized Deep Representation

Yongjie Duan,Zhiyu Pan,Jianjiang Feng,Jie Zhou

http://arxiv.org/abs/2311.18576v1

Compressor summary: The paper introduces LDRF, a flexible and accurate fixed-length fingerprint representation that uses localized deep learning to handle different visible areas and poses, and proposes a matching score normalization technique to reduce false matches in large databases.


Class Distribution Shifts in Zero-Shot Learning: Learning Robust Representations

Yuli Slavutsky,Yuval Benjamini

http://arxiv.org/abs/2311.18575v1

Compressor summary: The paper proposes an algorithm for zero-shot classifiers to handle distribution shifts in unseen classes by using hierarchical data sampling and out-of-distribution generalization.


Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation

Avijit Dasgupta,C. V. Jawahar,Karteek Alahari

http://arxiv.org/abs/2311.18572v1

Compressor summary: The paper proposes a self-training based source-free video domain adaptation method that handles noisy labels and uses a teacher-student framework to improve performance on target domain videos.


Grammatical Gender's Influence on Distributional Semantics: A Causal Perspective

Karolina Stańczak,Kevin Du,Adina Williams,Isabelle Augenstein,Ryan Cotterell

http://arxiv.org/abs/2311.18567v1

Compressor summary: The paragraph discusses a study that challenges the neo-Whorfian hypothesis by showing that grammatical gender has little to no impact on how people choose adjectives for inanimate nouns when controlling for meaning.


Seam-guided local alignment and stitching for large parallax images

Tianli Liao,Chenyang Zhao,Lei Li,Heling Cao

http://arxiv.org/abs/2311.18564v1

Compressor summary: The paper presents a local alignment and stitching method that improves image quality by evaluating seam quality and adjusting pixel regions with low quality.


Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering

Yurui Chen,Chun Gu,Junzhe Jiang,Xiatian Zhu,Li Zhang

http://arxiv.org/abs/2311.18561v1

Compressor summary: The Periodic Vibration Gaussian model uses a 3D Gaussian splatting technique with periodic vibrations to represent dynamic urban scenes and outperforms existing methods without relying on object labels or optical flow estimation.


Can semi-supervised learning use all the data effectively? A lower bound perspective

Alexandru Ţifrea,Gizem Yüce,Amartya Sanyal,Fanny Yang

http://arxiv.org/abs/2311.18557v1

Compressor summary: The paragraph discusses the limitations and potential of semi-supervised learning (SSL) algorithms in improving over supervised learning (SL) and unsupervised learning (UL) methods, using 2-Gaussian mixture models as an example.


Heterogeneous Graph-based Trajectory Prediction using Local Map Context and Social Interactions

Daniel Grimm,Maximilian Zipfl,Felix Hertlein,Alexander Naumann,Jürgen Lüttin,Steffen Thoma,Stefan Schmid,Lavdim Halilaj,Achim Rettinger,J. Marius Zöllner

http://arxiv.org/abs/2311.18553v1

Compressor summary: The paragraph describes a new vector-based approach for predicting traffic trajectories that improves on existing methods by using a semantic scene graph, image-based map features, and anchor paths to account for agent interactions, context, and constraints.


Real-Time Vibration-Based Bearing Fault Diagnosis Under Time-Varying Speed Conditions

Tuomas Jalonen,Mohammad Al-Sa'd,Serkan Kiranyaz,Moncef Gabbouj

http://arxiv.org/abs/2311.18547v1

Compressor summary: The paper presents a real-time CNN model for diagnosing multiple bearing faults under various conditions and compares it to the current state-of-the-art approach, showing significant accuracy gains and robustness to noise.


Match me if you can: Semantic Correspondence Learning with Unpaired Images

Jiwon Kim,Byeongho Heo,Sangdoo Yun,Seungryong Kim,Dongyoon Han

http://arxiv.org/abs/2311.18540v1

Compressor summary: The paper presents a simple method that uses unlabeled pairs to improve semantic correspondence without extra annotations, achieving better results than existing methods.


MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

Ju He,Qihang Yu,Inkyu Shin,Xueqing Deng,Xiaohui Shen,Alan Yuille,Liang-Chieh Chen

http://arxiv.org/abs/2311.18537v1

Compressor summary: MaXTron is a framework that uses Mask XFormer with Trajectory Attention for panoptic segmentation, enhancing temporal consistency with within-clip and cross-clip tracking modules.


Dataset Distillation via the Wasserstein Metric

Haoyang Liu,Tiancheng Xing,Luwei Li,Vibhu Dalal,Jingrui He,Haohan Wang

http://arxiv.org/abs/2311.18531v1

Compressor summary: The paper proposes a new dataset distillation method using Wasserstein distance to match synthetic data with extensive datasets, achieving better performance on several benchmarks.


HOT: Higher-Order Dynamic Graph Representation Learning with Efficient Transformers

Maciej Besta,Afonso Claudino Catarino,Lukas Gianinazzi,Nils Blach,Piotr Nyczyk,Hubert Niewiadomski,Torsten Hoefler

http://arxiv.org/abs/2311.18526v1

Compressor summary: The HOT model improves dynamic link prediction by using higher-order graph structures and hierarchy in attention matrices, achieving high accuracy with low memory usage.


Combining deep generative models with extreme value theory for synthetic hazard simulation: a multivariate and spatially coherent approach

Alison Peard,Jim Hall

http://arxiv.org/abs/2311.18521v1

Compressor summary: The authors propose a new method using GANs to simulate realistic compound hazards from climate risk data, which can help with climate adaptation and disaster preparedness.


Color-Emotion Associations in Art: Fuzzy Approach

Pakizar Shamoi,Muragul Muratbekova

http://arxiv.org/abs/2311.18518v1

Compressor summary: The paper presents a fuzzy set approach to classify emotions in paintings using color associations, which correlates well with human judgments and has potential applications in various fields.


Revisiting Proposal-based Object Detection

Aritra Bhowmik,Martin R. Oswald,Pascal Mettes,Cees G. M. Snoek

http://arxiv.org/abs/2311.18512v1

Compressor summary: The paper proposes a simpler and more effective alternative for detecting objects in images by regressing to intersections between proposals and ground truth boxes, instead of overlapping areas.


Accurate Segmentation of Optic Disc And Cup from Multiple Pseudo-labels by Noise-Aware Learning

Tengjin Weng,Yang Shen,Zhidong Zhao,Zhiming Cheng,Shuai Wang

http://arxiv.org/abs/2311.18496v1

Compressor summary: The proposed method uses multiple initialization networks to generate pseudo-labels and consensus information to denoise optic disc and cup segmentation data for better glaucoma screening and diagnosis.


Improving Adversarial Transferability via Model Alignment

Avery Ma,Amir-massoud Farahmand,Yangchen Pan,Philip Torr,Jindong Gu

http://arxiv.org/abs/2311.18495v1

Compressor summary: The paper proposes a method to make neural networks better at generating adversarial perturbations that work across different models by fine-tuning the source model using a witness model, and shows improved transferability in experiments.


ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs

Violeta Menéndez González,Andrew Gilbert,Graeme Phillipson,Stephen Jolly,Simon Hadfield

http://arxiv.org/abs/2311.18491v1

Compressor summary: ZeST-NeRF is a novel approach that can generate temporal NeRFs for new scenes without retraining, using multi-view synthesis techniques and scene flow-field estimation, achieving improved visual and quantitative results.


Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi,Miao Wang,Hao-Bin Duan,Shao-Hua Guan

http://arxiv.org/abs/2311.18482v1

Compressor summary: The text introduces Language Embedded 3D Gaussians, a new way to represent scenes for open-vocabulary query tasks that uses less memory and performs better than previous approaches.


ESG Accountability Made Easy: DocQA at Your Service

Lokesh Mishra,Cesar Berrospi,Kasper Dinkla,Diego Antognini,Francesco Fusco,Benedikt Bothur,Maksym Lysak,Nikolaos Livathinos,Ahmed Nassar,Panagiotis Vagenas,Lucas Morin,Christoph Auer,Michele Dolfi,Peter Staar

http://arxiv.org/abs/2311.18481v1

Compressor summary: Deep Search DocQA is a conversational AI system that helps users extract information from ESG reports using computer vision, NLP, and language models.


Use of explicit replies as coordination mechanisms in online student debate

Bruno D. Ferreira-Saraiva,Joao P. Matos-Carvalho,Manuel Pita

http://arxiv.org/abs/2311.18466v1

Compressor summary: The text discusses how explicit replies in computer-mediated communication affect the structure of conversations and how to identify roles of utterances using a hierarchical topic model.


Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework

Maresa Schröder,Dennis Frauen,Stefan Feuerriegel

http://arxiv.org/abs/2311.18460v1

Compressor summary: This paper studies how unobserved confounding affects causal fairness in machine learning and proposes a new neural framework to learn fair predictions despite this challenge.


How Much Is Hidden in the NAS Benchmarks? Few-Shot Adaptation of a NAS Predictor

Hrushikesh Loya,Łukasz Dudziak,Abhinav Mehrotra,Royson Lee,Javier Fernandez-Marques,Nicholas D. Lane,Hongkai Wen

http://arxiv.org/abs/2311.18451v1

Compressor summary: The text discusses using meta-learning methods from few-shot adaptation to improve neural architecture search for diverse tasks, with a focus on reducing its cost and uncertainty in under-represented domains.


HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

Zicong Fan,Maria Parelli,Maria Eleni Kadoglou,Muhammed Kocabas,Xu Chen,Michael J. Black,Otmar Hilliges

http://arxiv.org/abs/2311.18448v1

Compressor summary: The paper introduces HOLD, a method that can reconstruct 3D hand and object interactions from monocular videos without relying on pre-scanned templates or limited data, using an implicit model and hand-object constraints.


VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang,Xin Wang,Hong Chen,Zihan Song,Wenwu Zhu

http://arxiv.org/abs/2311.18445v1

Compressor summary: VTimeLLM is a novel Video LLM that uses a three-stage training strategy to improve fine-grained video moment understanding and reasoning with respect to time boundaries, outperforming existing Video LLMs in various tasks.


The Sliding Regret in Stochastic Bandits: Discriminating Index and Randomized Policies

Victor Boone

http://arxiv.org/abs/2311.18437v1

Compressor summary: The paper investigates how well no-regret algorithms perform in one-shot stochastic bandits and finds that randomized methods have better sliding regret than index policies.


Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

Zipeng Qi,Guoxi Huang,Zebin Huang,Qin Guo,Jinwen Chen,Junyu Han,Jian Wang,Gang Zhang,Lufei Liu,Errui Ding,Jingdong Wang

http://arxiv.org/abs/2311.18435v1

Compressor summary: The paper presents two innovations for improving spatial controllability in text-based diffusion models, called Vision Guidance and Layered Rendering Diffusion, which lead to more efficient and accurate image synthesis with specific spatial and contextual requirements.


Exploring the Temperature-Dependent Phase Transition in Modern Hopfield Networks

Felix Koulischer,Cédric Goemaere,Tom van der Meersch,Johannes Deleu,Thomas Demeester

http://arxiv.org/abs/2311.18434v1

Compressor summary: The paper investigates how the inverse temperature hyperparameter affects the performance of Modern Hopfield Networks (MHNs) and suggests that understanding it could help optimize Transformers in the future.


E2PNet: Event to Point Cloud Registration with Spatio-Temporal Representation Learning

Xiuhong Lin,Changjie Qiu,Zhipeng Cai,Siqi Shen,Yu Zang,Weiquan Liu,Xuesheng Bian,Matthias Müller,Cheng Wang

http://arxiv.org/abs/2311.18433v1

Compressor summary: E2PNet is a novel method that registers 2D RGB images to 3D point clouds using event data, outperforming other methods and being robust to extreme illumination or fast motion.


TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing

Lianrui Mu,Jianhong Bai,Xiaoxuan He,Jiangnan Ye,Xiaoyu Liang,Yuchen Yang,Jiedong Zhuang,Haoji Hu

http://arxiv.org/abs/2311.18420v1

Compressor summary: TeG-DG is a framework that leverages text information to improve the domain generalization of Face Anti-Spoofing techniques, achieving better performance especially with limited source domain data.


CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model

Jianhao Zeng,Dan Song,Weizhi Nie,Hongshuo Tian,Tongtong Wang,Anan Liu

http://arxiv.org/abs/2311.18405v1

Compressor summary: CAT-DM is a new method for virtual try-on that combines controllability and acceleration using a diffusion model, outperforming previous methods in image quality and pattern reproduction.


Corrupting Convolution-based Unlearnable Datasets with Pixel-based Image Transformations

Xianlong Wang,Shengshan Hu,Minghui Li,Zhifei Yu,Ziqi Zhou,Leo Yu Zhang,Hai Jin

http://arxiv.org/abs/2311.18403v1

Compressor summary: The authors propose a new image corruption method to defend against a type of unlearnable dataset that uses convolution and random matrices to counteract imperceptible perturbations in training data.


MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

Dan Song,Xinwei Fu,Weizhi Nie,Wenhui Li,Anan Liu

http://arxiv.org/abs/2311.18402v1

Compressor summary: The paper proposes view selection and hierarchical prompts to improve zero-shot 3D shape recognition using language-image pre-trained models like CLIP.


RainAI -- Precipitation Nowcasting from Satellite Data

Rafael Pablos Sarabia,Joachim Nyborg,Morten Birk,Ira Assent

http://arxiv.org/abs/2311.18398v1

Compressor summary: Key points: - Paper presents a solution for forecasting high-resolution precipitation using satellite images - Proposes a 2D U-Net model that outperforms the official 3D U-Net baseline - Refines the dataset through importance sampling and dataset preparation - Explores alternative cross-entropy loss function and Conditioning Lead Time - Evaluates standard and learned upsampling methods for high-resolution forecasts Summary: The paper proposes a 2D U-Net model that uses refined satellite images, an improved cross-entropy loss function, and Conditioning Lead Time to forecast high-resolution precipitation more accurately than the official 3D U-Net baseline.


IAG: Induction-Augmented Generation Framework for Answering Reasoning Questions

Zhebin Zhang,Xinyu Zhang,Yuanhang Ren,Saijiang Shi,Meng Han,Yongkang Wu,Ruofei Lai,Zhao Cao

http://arxiv.org/abs/2311.18397v1

Compressor summary: The paper introduces Induction-Augmented Generation (IAG), which uses inductive reasoning to generate implicit knowledge for open-domain QA tasks, improving performance over existing methods.


Data-efficient Deep Reinforcement Learning for Vehicle Trajectory Control

Bernd Frauenknecht,Tobias Ehlgen,Sebastian Trimpe

http://arxiv.org/abs/2311.18393v1

Compressor summary: The paper explores three data-efficient deep RL methods for vehicle trajectory control and proposes a new model-based formulation that improves their performance over standard approaches like soft-actor critic.


On Exact Inversion of DPM-Solvers

Seongmin Hong,Kyeonghyun Lee,Suh Yoon Jeon,Hyewon Bae,Se Young Chun

http://arxiv.org/abs/2311.18387v1

Compressor summary: The paper proposes algorithms for finding the initial noise from images generated by diffusion probabilistic models, improving robustness and quality of image editing tasks.


A Survey on Deep Learning for Polyp Segmentation: Techniques, Challenges and Future Trends

Jiaxin Mei,Tao Zhou,Kaiwen Huang,Yizhe Zhang,Yi Zhou,Ye Wu,Huazhu Fu

http://arxiv.org/abs/2311.18373v1

Compressor summary: This paper reviews polyp segmentation algorithms, comparing traditional methods with deep learning models and evaluating their performance on benchmark datasets.


Hubness Reduction Improves Sentence-BERT Semantic Spaces

Beatrix M. G. Nielsen,Lars Kai Hansen

http://arxiv.org/abs/2311.18364v1

Compressor summary: The authors study Sentence-BERT embeddings and find that they have a problem called hubness, which affects the quality of semantic representations; they propose a combination of two methods to reduce this issue and improve the results.


Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation

Ziyang Chen,Yiwen Ye,Mengkang Lu,Yongsheng Pan,Yong Xia

http://arxiv.org/abs/2311.18363v1

Compressor summary: The paper proposes a new method called VPTTA that adapts visual prompts to adjust semantic segmentation models for different medical images without updating the pre-trained model, achieving better results than other methods.


Automating lookahead planning using site appearance and space utilization

Eyob Mengiste,Borja Garcia de Soto,Timo Hartmann

http://arxiv.org/abs/2311.18361v1

Compressor summary: The study presents a method to automatically generate lookahead plans for construction projects using a neural network model that considers material conditions, space utilization, and project timeline.


TIDE: Test Time Few Shot Object Detection

Weikai Li,Hongfeng Wei,Yanlai Wu,Jie Yang,Yudi Ruan,Yuan Li,Ying Tang

http://arxiv.org/abs/2311.18358v1

Compressor summary: TIDE is a novel FSOD method that learns from untuned support instances and uses cross-attention and multi-scale resizing to improve performance, overcoming limitations of existing methods in Industry 5.0 scenarios.


Towards Comparable Active Learning

Thorben Werner,Johannes Burchert,Lars Schmidt-Thieme

http://arxiv.org/abs/2311.18356v1

Compressor summary: The paper presents an Active Learning framework to compare algorithms fairly across tasks and domains, and proposes the first benchmark testing algorithms in Tabular, Image, and Text domains.


Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

Akira Kawabata,Saku Sugawara

http://arxiv.org/abs/2311.18353v1

Compressor summary: The authors present a dataset to test language models' understanding of the rationale behind critical reasoning in logical reading comprehension tasks, and find that current models struggle to explain why incorrect options should be eliminated.


DSeg: Direct Line Segments Detection

Berger Cyrille,Lacroix Simon

http://arxiv.org/abs/2311.18344v1

Compressor summary: The paper proposes a fast, robust, and parameter-free model-driven method for detecting image line segments using a linear Kalman filter and a pyramidal extension.


Learning Robust Precipitation Forecaster by Temporal Frame Interpolation

Lu Han,Xu-Yang Chen,Han-Jia Ye,De-Chuan Zhan

http://arxiv.org/abs/2311.18341v1

Compressor summary: This paper proposes TFI, a technique to generate synthetic data from adjacent frames and a multi-level dice loss to improve precipitation forecasting models' robustness against spatial-temporal shifts.


Multilevel Saliency-Guided Self-Supervised Learning for Image Anomaly Detection

Jianjian Qin,Chunzhi Gu,Jun Yu,Chao Zhang

http://arxiv.org/abs/2311.18332v1

Compressor summary: The paper proposes a new anomaly detection method called CutSwap, which uses saliency maps to generate realistic negative samples for self-supervised learning in computer vision.


MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation

Sumanth Udupa,Prajwal Gurunath,Aniruddh Sikdar,Suresh Sundaram

http://arxiv.org/abs/2311.18331v1

Compressor summary: The paper proposes a technique called MRFP to improve semantic scene understanding by randomizing fine-grained and coarse features in deep neural networks, enabling better generalization from simulated data to real-world scenes.


Advances in 3D Neural Stylization: A Survey

Yingshu Chen,Guocheng Shao,Ka Chun Shum,Binh-Son Hua,Sai-Kit Yeung

http://arxiv.org/abs/2311.18328v1

Compressor summary: The paper surveys recent advances in using artificial intelligence to create digital art by transforming 3D data into various styles, and explores its applications and challenges.


Anisotropic Neural Representation Learning for High-Quality Neural Rendering

Y. Wang,J. Xu,Y. Zeng,Y. Gong

http://arxiv.org/abs/2311.18311v1

Compressor summary: The paper proposes a method to improve NeRF's view synthesis by learning anisotropic features that eliminate ambiguity and enhance scene representation, achieving better rendering quality on synthetic and real data.


Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent

Yuxiao Chen,Sander Tonkens,Marco Pavone

http://arxiv.org/abs/2311.18307v1

Compressor summary: CTT is a traffic model that outputs continuous and categorical predictions, has an interpretable latent space, and can integrate with large language models for better autonomous vehicle planning and simulation.


OmniMotionGPT: Animal Motion Generation with Limited Data

Zhangsihao Yang,Mingyuan Zhou,Mengyi Shan,Bingbing Wen,Ziwei Xuan,Mitch Hill,Junjie Bai,Guo-Jun Qi,Yalin Wang

http://arxiv.org/abs/2311.18303v1

Compressor summary: The paper proposes a model to generate diverse and realistic animal motions from text descriptions using knowledge from human motion synthesis and introduces a new dataset with 36 animal identities.


Reconstructing the normal and shape at specularities in endoscopy

Karim Makki,Adrien Bartoli

http://arxiv.org/abs/2311.18299v1

Compressor summary: The paper proposes a new method to use specularities in endoscopic images as cues for 3D perception by reconstructing the tissue's normal direction and shape from a single image.


TrustMark: Universal Watermarking for Arbitrary Resolution Images

Tu Bui,Shruti Agarwal,John Collomosse

http://arxiv.org/abs/2311.18297v1

Compressor summary: TrustMark is a GAN-based watermarking method that balances image quality and watermark recovery accuracy, with robustness to various perturbations and a watermark remover counterpart.


Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Zhiwei Deng,Ting Chen,Yang Li

http://arxiv.org/abs/2311.18296v1

Compressor summary: The paper presents the Perceptual Group Tokenizer, a model that uses perceptual grouping to extract visual features and learn representations without label supervision, achieving competitive performance on ImageNet-1K benchmark.


TLDR: Text Based Last-layer Retraining for Debiasing Image Classifiers

Juhyeon Park,Seokhyeon Jeong,Taesup Moon

http://arxiv.org/abs/2311.18291v1

Compressor summary: The paper proposes TLDR, a method to reduce spurious correlations in image classifiers by using texts generated by large language models as proxies for images and filtering noisy words.


CosAvatar: Consistent and Animatable Portrait Video Tuning with Text Prompt

Haiyao Xiao,Chenglai Zhong,Xuan Gao,Yudong Guo,Juyong Zhang

http://arxiv.org/abs/2311.18288v1

Compressor summary: CosAvatar is a framework for portrait tuning that uses monocular video and text inputs to create animatable portraits with temporal and 3D consistency, enabling precise editing of styles and attributes based on text instructions.


SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation

Lingyi Hong,Wei Zhang,Shuyong Gao,Hong Lu,WenQiang Zhang

http://arxiv.org/abs/2311.18286v1

Compressor summary: SimulFlow is a novel method for unsupervised video object segmentation that simultaneously extracts features and identifies targets using a SimulFlow Attention mechanism, achieving state-of-the-art performance while addressing computational complexity and fusion difficulties.


HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance

Zhuohao Yin,Xin Huang

http://arxiv.org/abs/2311.18273v1

Compressor summary: The paper introduces a multi-modal framework that uses pretrained models, knowledge bases, and datasets to disambiguate word meanings from images, and shares insights and code for the research community.


Beyond Entropy: Style Transfer Guided Single Image Continual Test-Time Adaptation

Younggeol Cho,Youngrae Kim,Dongman Lee

http://arxiv.org/abs/2311.18270v1

Compressor summary: BESTTA is a novel method that uses style transfer to adapt models to changing environments with limited resources, achieving accuracy and efficiency using only a single image.


Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning

Ruxiao Duan,Yaoyao Liu,Jieneng Chen,Adam Kortylewski,Alan Yuille

http://arxiv.org/abs/2311.18266v1

Compressor summary: ESCORT is a novel method for class-incremental learning that compresses old images into prompts and generates diverse exemplars from them using an off-the-shelf diffusion model, improving performance significantly on multiple benchmarks.


MCI Detection using fMRI time series embeddings of Recurrence plots

Ninad Aithal,Chakka Sai Pradeep,Neelam Sinha

http://arxiv.org/abs/2311.18265v1

Compressor summary: The study uses resting state fMRI to analyze brain network dynamics and classify healthy subjects from those with Mild Cognitive Impairment with a high accuracy rate.


Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman,Andrew Westbury,Lorenzo Torresani,Kris Kitani,Jitendra Malik,Triantafyllos Afouras,Kumar Ashutosh,Vijay Baiyya,Siddhant Bansal,Bikram Boote,Eugene Byrne,Zach Chavis,Joya Chen,Feng Cheng,Fu-Jen Chu,Sean Crane,Avijit Dasgupta,Jing Dong,Maria Escobar,Cristhian Forigua,Abrham Gebreselasie,Sanjay Haresh,Jing Huang,Md Mohaiminul Islam,Suyog Jain,Rawal Khirodkar,Devansh Kukreja,Kevin J Liang,Jia-Wei Liu,Sagnik Majumder,Yongsen Mao,Miguel Martin,Effrosyni Mavroudi,Tushar Nagarajan,Francesco Ragusa,Santhosh Kumar Ramakrishnan,Luigi Seminara,Arjun Somayazulu,Yale Song,Shan Su,Zihui Xue,Edward Zhang,Jinxu Zhang,Angela Castillo,Changan Chen,Xinzhu Fu,Ryosuke Furuta,Cristina Gonzalez,Prince Gupta,Jiabo Hu,Yifei Huang,Yiming Huang,Weslie Khoo,Anush Kumar,Robert Kuo,Sach Lakhavani,Miao Liu,Mi Luo,Zhengyi Luo,Brighid Meredith,Austin Miller,Oluwatumininu Oguntola,Xiaqing Pan,Penny Peng,Shraman Pramanick,Merey Ramazanova,Fiona Ryan,Wei Shan,Kiran Somasundaram,Chenan Song,Audrey Southerland,Masatoshi Tateno,Huiyu Wang,Yuchen Wang,Takuma Yagi,Mingfei Yan,Xitong Yang,Zecheng Yu,Shengxin Cindy Zha,Chen Zhao,Ziwei Zhao,Zhifan Zhu,Jeff Zhuo,Pablo Arbelaez,Gedas Bertasius,David Crandall,Dima Damen,Jakob Engel,Giovanni Maria Farinella,Antonino Furnari,Bernard Ghanem,Judy Hoffman,C. V. Jawahar,Richard Newcombe,Hyun Soo Park,James M. Rehg,Yoichi Sato,Manolis Savva,Jianbo Shi,Mike Zheng Shou,Michael Wray

http://arxiv.org/abs/2311.18259v1

Compressor summary: Ego-Exo4D is a large and diverse dataset of multimodal videos with various human activities, contexts, and annotations for research purposes.


Diffusion Models Without Attention

Jing Nathan Yan,Jiatao Gu,Alexander M. Rush

http://arxiv.org/abs/2311.18257v1

Compressor summary: DiffuSSM is a scalable state space model for high-fidelity image generation that preserves detailed images without global compression, offering better performance and lower computational cost than existing models.


Sketch Input Method Editor: A Comprehensive Dataset and Methodology for Systematic Input Recognition

Guangming Zhu,Siyuan Wang,Qing Cheng,Kelong Wu,Hao Li,Liang Zhang

http://arxiv.org/abs/2311.18254v1

Compressor summary: This study introduces SketchIME, a sketch input method for creating situation maps in C4I systems, with a new dataset and recognition architecture that adapts to new users and tasks.


Combined Scheduling, Memory Allocation and Tensor Replacement for Minimizing Off-Chip Data Accesses of DNN Accelerators

Yi Li,Aarti Gupta,Sharad Malik

http://arxiv.org/abs/2311.18246v1

Compressor summary: COSMA is an optimization framework that minimizes additional data accesses in specialized hardware accelerators for Deep Neural Networks.


LLVMs4Protest: Harnessing the Power of Large Language and Vision Models for Deciphering Protests in the News

Yongjun Zhang

http://arxiv.org/abs/2311.18241v1

Compressor summary: The authors fine-tuned two large transformer models, longformer and swin-transformer v2, to identify potential protests in news articles and images using the DoCA Corpus and UCLA-protest imagery data, and provided the models via GitHub for social movement scholars.


Label-efficient Training of Small Task-specific Models by Leveraging Vision Foundation Models

Raviteja Vemulapalli,Hadi Pouransari,Fartash Faghri,Sachin Mehta,Mehrdad Farajtabar,Mohammad Rastegari,Oncel Tuzel

http://arxiv.org/abs/2311.18237v1

Compressor summary: The paper proposes a simple and effective method to use large pretrained models in resource-limited settings by transferring task-specific knowledge from them to small task-oriented models.


LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai,Isadora White,Charlie Snell,Charles Sun,Joey Hong,Yuexiang Zhai,Kelvin Xu,Sergey Levine

http://arxiv.org/abs/2311.18232v1

Compressor summary: The paragraph discusses the potential of reinforcement learning to create goal-directed language agents using large language models and introduces LMRL-Gym, an open-source benchmark for evaluating multi-turn RL for LLMs on various tasks.


TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model

Hantao Yao,Rui Zhang,Changsheng Xu

http://arxiv.org/abs/2311.18231v1

Compressor summary: TCP is a method for improving visual-language models' ability to generate task-specific textual classifiers by incorporating prior knowledge about classes and using Textual Knowledge Embedding.


FS-BAND: A Frequency-Sensitive Banding Detector

Zijian Chen,Wei Sun,Zicheng Zhang,Ru Huang,Fangfang Lu,Xiongkuo Min,Guangtao Zhai,Wenjun Zhang

http://arxiv.org/abs/2311.18216v1

Compressor summary: The paper introduces a new model called FS-BAND that can detect and evaluate banding artifacts, a common video compression issue, using frequency analysis and outperforms existing image quality assessment methods.


Automatic Construction of a Korean Toxic Instruction Dataset for Ethical Tuning of Large Language Models

Sungjoo Byun,Dongjun Jang,Hyemi Jo,Hyopil Shin

http://arxiv.org/abs/2311.18215v1

Compressor summary: KoTox is a collection of toxic instructions that helps train Large Language Models to produce less unethical language and respond better to toxic inputs in NLP applications.


SMaRt: Improving GANs with Score Matching Regularity

Mengfei Xia,Yujun Shen,Ceyuan Yang,Ran Yi,Wenping Wang,Yong-jin Liu

http://arxiv.org/abs/2311.18208v1

Compressor summary: This paper proposes using score matching to improve GANs' ability to generate data that matches the real data manifold, resulting in better synthesis performance on diverse and complex datasets.


Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation

Haruka Kiyohara,Ren Kishimoto,Kosuke Kawakami,Ken Kobayashi,Kazuhide Nakata,Yuta Saito

http://arxiv.org/abs/2311.18207v1

Compressor summary: SharpeRatio@k is a new metric that measures the risk-return tradeoff of policy portfolios formed by an offline evaluation method called Off-Policy Evaluation (OPE), helping to identify the most efficient estimator for online deployment.


SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

Haruka Kiyohara,Ren Kishimoto,Kosuke Kawakami,Ken Kobayashi,Kazuhide Nakata,Yuta Saito

http://arxiv.org/abs/2311.18206v1

Compressor summary: SCOPE-RL is an open-source Python software that supports offline RL and OPE with flexible and reliable OPE modules, user-friendly APIs, and comprehensive documentation.


INarIG: Iterative Non-autoregressive Instruct Generation Model For Word-Level Auto Completion

Hengchao Shang,Zongyao Li,Daimeng Wei,Jiaxin Guo,Minghan Wang,Xiaoyu Chen,Lizhi Lei,Hao Yang

http://arxiv.org/abs/2311.18200v1

Compressor summary: The paper introduces INarIG, a model that predicts target words in Word-Level Auto Completion (WLAC) tasks by using human typed sequences as Instruction Units and iterative decoding with subwords to improve translation efficiency for low-frequency words.


Hy-Tracker: A Novel Framework for Enhancing Efficiency and Accuracy of Object Tracking in Hyperspectral Videos

Mohammad Aminul Islam,Wangzhi Xing,Jun Zhou,Yongsheng Gao,Kuldip K. Paliwal

http://arxiv.org/abs/2311.18199v1

Compressor summary: The paper proposes Hy-Tracker, a framework that uses YOLOv7 for object detection and tracking in hyperspectral videos, addressing challenges like multiple spectral bands, scarce annotations, occlusions, and cluttered backgrounds.


S-T CRF: Spatial-Temporal Conditional Random Field for Human Trajectory Prediction

Pengqian Han,Jiamou Liu,Jialing He,Zeyu Zhang,Song Yang,Yanni Tang,Partha Roop

http://arxiv.org/abs/2311.18198v1

Compressor summary: The paper introduces a novel spatial-temporal conditional random field model for pedestrian trajectory prediction that incorporates intention information, improving performance over existing methods.


COVID-19 Vaccine Misinformation in Middle Income Countries

Jongin Kim,Byeo Rhee Back,Aditya Agrawal,Jiaxi Wu,Veronika J. Wirtz,Traci Hong,Derry Wijaya

http://arxiv.org/abs/2311.18195v1

Compressor summary: The paper presents a multilingual dataset of COVID-19 vaccine misinformation tweets from Brazil, Indonesia, and Nigeria, and proposes two methods to improve misinformation detection models.


Positional Information Matters for Invariant In-Context Learning: A Case Study of Simple Function Classes

Yongqiang Chen,Binghui Xie,Kaiwen Zhou,Bo Han,Yatao Bian,James Cheng

http://arxiv.org/abs/2311.18194v1

Compressor summary: The paper investigates the limitations of in-context learning (ICL) in large language models and proposes a new architecture, DeepSet, which preserves input symmetry for better ICL performance.


An Effective Universal Polynomial Basis for Spectral Graph Neural Networks

Keke Huang,Pietro Liò

http://arxiv.org/abs/2311.18177v1

Compressor summary: The paper proposes UniBasis, a universal polynomial basis that adapts to different levels of graph heterophily, and UniFilter, a general polynomial filter that uses UniBasis for efficient graph analysis.


Few-shot Image Generation via Style Adaptation and Content Preservation

Xiaosheng He,Fan Yang,Fayao Liu,Guosheng Lin

http://arxiv.org/abs/2311.18169v1

Compressor summary: The paper proposes a new image translation module for GAN transferring that helps preserve content and style when training with limited data, outperforming existing methods.


Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

Karren D. Yang,Anurag Ranjan,Jen-Hao Rick Chang,Raviteja Vemulapalli,Oncel Tuzel

http://arxiv.org/abs/2311.18168v1

Compressor summary: The paper proposes a probabilistic approach for animating 3D facial geometry from speech signals, addressing key challenges in data and metrics, and showing applications such as generating diverse speech-driven 3D facial motion and improving downstream audio-visual models.


A-Scan2BIM: Assistive Scan to Building Information Modeling

Weilian Song,Jieliang Luo,Dale Zhao,Yan Fu,Chin-Yi Cheng,Yasutaka Furukawa

http://arxiv.org/abs/2311.18166v1

Compressor summary: The paper introduces an assistive system for architects that converts large-scale point clouds into standardized digital building models using predicted editing operations as APIs of Autodesk Revit software.


Compact3D: Compressing Gaussian Splat Radiance Field Models with Vector Quantization

KL Navaneet,Kossar Pourahmadi Meibodi,Soroush Abbasi Koohpayegani,Hamed Pirsiavash

http://arxiv.org/abs/2311.18159v1

Compressor summary: The paper introduces a vector quantization and compression method to reduce the storage cost of 3D Gaussian Splatting, a fast NeRF alternative, while maintaining image quality.


HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

Yifan Zhang,Bryan Hooi

http://arxiv.org/abs/2311.18158v1

Compressor summary: HiPA is a method to improve the quality and speed of text-to-image diffusion by focusing on enhancing high-frequency information using low-rank adaptors.