arxiv compressed, 2023-11-29

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2023-11-29 generated by the compressor, my personal LLM-based project.


Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Daniel Geng,Inbum Park,Andrew Owens

http://arxiv.org/abs/2311.17919v1

Compressor summary: The authors propose a simple method to create multi-view optical illusions using text-to-image diffusion models and noise estimation from different views, resulting in visual anagrams that change appearance under certain transformations.


Do text-free diffusion models learn discriminative visual representations?

Soumik Mukhopadhyay,Matthew Gwilliam,Yosuke Yamaguchi,Vatsal Agarwal,Namitha Padmanabhan,Archana Swaminathan,Tianyi Zhou,Abhinav Shrivastava

http://arxiv.org/abs/2311.17921v1

Compressor summary: The paper proposes a unified representation learner that combines generative and discriminative tasks using diffusion models, which use U-Nets to remove noise and produce diverse, high-quality images, and introduces new mechanisms for feature fusion and feedback to improve performance on various image tasks.


A Simple Recipe for Language-guided Domain Generalized Segmentation

Mohammad Fahes,Tuan-Hung Vu,Andrei Bursuc,Patrick Pérez,Raoul de Charette

http://arxiv.org/abs/2311.17922v1

Compressor summary: The paper presents a framework for improving semantic segmentation with neural networks by using language to randomize and augment the data, while preserving CLIP's robustness.


Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

Yuqi Wang,Jiawei He,Lue Fan,Hongxin Li,Yuntao Chen,Zhaoxiang Zhang

http://arxiv.org/abs/2311.17918v1

Compressor summary: Drive-WM is a driving world model that generates high-fidelity multiview videos in driving scenes to enhance autonomous vehicle safety and efficiency by predicting future events and evaluating risks.


OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Qidong Huang,Xiaoyi Dong,Pan Zhang,Bin Wang,Conghui He,Jiaqi Wang,Dahua Lin,Weiming Zhang,Nenghai Yu

http://arxiv.org/abs/2311.17911v1

Compressor summary: OPERA is a new method to reduce hallucination in multi-modal language models by penalizing over-trust and retrospecting token selection during decoding.


HUGS: Human Gaussian Splats

Muhammed Kocabas,Jen-Hao Rick Chang,James Gabriel,Oncel Tuzel,Anurag Ranjan

http://arxiv.org/abs/2311.17910v1

Compressor summary: HUGS uses 3D Gaussian Splatting and a monocular video with a small number of frames to learn an animatable human avatar from a static scene, achieving state-of-the-art rendering quality and speed.


CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting

Alexander Vilesov,Pradyumna Chari,Achuta Kadambi

http://arxiv.org/abs/2311.17907v1

Compressor summary: CG3D is a method for generating detailed 3D graphics with text-conditioned guidance, overcoming constraints such as limited scene complexity and physically unrealistic compositions.


Language-conditioned Detection Transformer

Jang Hyun Cho,Philipp Krähenbühl

http://arxiv.org/abs/2311.17902v1

Compressor summary: The paper introduces DECOLA, an open-vocabulary detection framework that uses image-level labels and detailed annotations to train language-conditioned and unconditioned detectors for zero-shot performance on various benchmarks.


SODA: Bottleneck Diffusion Models for Representation Learning

Drew A. Hudson,Daniel Zoran,Mateusz Malinowski,Andrew K. Lampinen,Andrew Jaegle,James L. McClelland,Loic Matthey,Felix Hill,Alexander Lerchner

http://arxiv.org/abs/2311.17901v1

Compressor summary: SODA is a self-supervised diffusion model that learns strong visual representations by generating related novel views from compact source view encodings, enabling unsupervised ImageNet classification, reconstruction, editing, and synthesis tasks.


Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis

Jinqi Luo,Kwan Ho Ryan Chan,Dimitris Dimos,René Vidal

http://arxiv.org/abs/2311.17898v1

Compressor summary: KPP is a zero-shot framework that uses external knowledge to enhance text-driven generative models' quality and faithfulness in multiple tasks without accessing their parameters.


Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Shuangrui Ding,Rui Qian,Haohang Xu,Dahua Lin,Hongkai Xiong

http://arxiv.org/abs/2311.17893v1

Compressor summary: The paper presents a simple self-supervised video object segmentation method that uses DINO-pretrained Transformers and clustering to achieve state-of-the-art results without auxiliary modalities or slot attention.


A Pipeline For Discourse Circuits From CCG

Jonathon Liu,Razin A. Shaikh,Benjamin Rodatz,Richie Yeung,Bob Coecke

http://arxiv.org/abs/2311.17892v1

Compressor summary: DisCoCirc is a new model that connects linguistic theory and modern NLP by representing text as circuits that capture meaning and can be used with classical or quantum methods.


Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation

Or Hirschorn,Shai Avidan

http://arxiv.org/abs/2311.17891v1

Compressor summary: The paper introduces a new category-agnostic pose estimation method that uses a Graph Transformer Decoder to capture geometrical relations between keypoints, improving accuracy on the MP-100 benchmark.


TSDF-Sampling: Efficient Sampling for Neural Surface Field using Truncated Signed Distance Field

Chaerin Min,Sehyun Cha,Changhee Won,Jongwoo Lim

http://arxiv.org/abs/2311.17878v1

Compressor summary: The paper proposes a new method to speed up multi-view neural surface reconstruction by using the Truncated Signed Distance Field (TSDF) of the scene, which reduces the number of samplings and maintains high rendering quality.


Enhancing Post-Hoc Explanation Benchmark Reliability for Image Classification

Tristan Gomez,Harold Mouchère

http://arxiv.org/abs/2311.17876v1

Compressor summary: This paper uses Krippendorf's alpha to measure the reliability of image classification explanation methods and suggests model modifications to improve it.


FisherRF: Active View Selection and Uncertainty Quantification for Radiance Fields using Fisher Information

Wen Jiang,Boshu Lei,Kostas Daniilidis

http://arxiv.org/abs/2311.17874v1

Compressor summary: The study proposes a new method using Fisher Information to efficiently select informative views and quantify uncertainty in Neural Radiance Fields, achieving state-of-the-art results and fast performance.


SAIBench: A Structural Interpretation of AI for Science Through Benchmarks

Yatao Li,Jianfeng Zhan

http://arxiv.org/abs/2311.17869v1

Compressor summary: AI4S research uses machine learning to improve scientific computing, but needs better benchmarking methods like structural interpretation to ensure accuracy in real-world applications.


Gaussian Shell Maps for Efficient 3D Human Generation

Rameen Abdal,Wang Yifan,Zifan Shi,Yinghao Xu,Ryan Po,Zhengfei Kuang,Qifeng Chen,Dit-Yan Yeung,Gordon Wetzstein

http://arxiv.org/abs/2311.17857v1

Compressor summary: Gaussian Shell Maps (GSMs) are a new framework for generating high-quality, multi-view consistent 3D digital humans using an articulable scaffold of inflated and deflated shells with 3D Gaussian rendering primitives, avoiding the need for volume representations or view-inconsistent upsamplers.


Leveraging Graph Diffusion Models for Network Refinement Tasks

Puja Trivedi,Ryan Rossi,David Arbour,Tong Yu,Franck Dernoncourt,Sungchul Kim,Nedim Lipka,Namyong Park,Nesreen K. Ahmed,Danai Koutra

http://arxiv.org/abs/2311.17856v1

Compressor summary: The paragraph introduces a new graph generative framework called SGDM that uses subgraph diffusion to refine noisy and incomplete networks in various ways, such as removing unwanted subgraphs, expanding existing ones, and changing their style.


Maximum Entropy Model Correction in Reinforcement Learning

Amin Rakhsha,Mete Kemertas,Mohammad Ghavamzadeh,Amir-massoud Farahmand

http://arxiv.org/abs/2311.17855v1

Compressor summary: The paper proposes a method for planning in reinforcement learning using an approximate model that can reduce error, accelerate convergence, and outperform traditional approaches.


Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects

Rishabh Kabra,Loic Matthey,Alexander Lerchner,Niloy J. Mitra

http://arxiv.org/abs/2311.17851v1

Compressor summary: The paper proposes a method to leverage pretrained vision language models for various annotation tasks involving unlabeled 3D objects by aggregating their scores and improving downstream predictions.


Towards Real-World Focus Stacking with Deep Learning

Alexandre Araujo,Jean Ponce,Julien Mairal

http://arxiv.org/abs/2311.17846v1

Compressor summary: The paper introduces a new dataset and deep learning algorithm for focus stacking in photography that works well with long bursts of real-world images and is more robust to noise.


SPiC-E : Structural Priors in 3D Diffusion Models using Cross Entity Attention

Etai Sella,Gal Fiebelman,Noam Atia,Hadar Averbuch-Elor

http://arxiv.org/abs/2311.17834v1

Compressor summary: SPiC-E is a neural network that improves 3D diffusion models by using cross-entity attention to learn structural guidance from auxiliary shapes, enabling various applications with high quality and speed.


Analyzing and Explaining Image Classifiers via Diffusion Guidance

Maximilian Augustin,Yannic Neuhaus,Matthias Hein

http://arxiv.org/abs/2311.17833v1

Compressor summary: The paper proposes a framework for generating images that help analyze and improve image classifiers' reliability and explainability, revealing new and existing failure modes.


Anomalous Behavior Detection in Trajectory Data of Older Drivers

Seyedeh Gol Ara Ghoreishi,Sonia Moshfeghi,Muhammad Tanveer Jan,Joshua Conniff,KwangSoo Yang,Jinwoo Jang,Borko Furht,Ruth Tappen,David Newman,Monica Rosselli,Jiannan Zhai

http://arxiv.org/abs/2311.17822v1

Compressor summary: The paper proposes a method to detect drivers with unusual behavior from large datasets of detailed trajectories, which can help with applications like MCI detection and safe route recommendations for older drivers.


Higher-Order DisCoCat (Peirce-Lambek-Montague semantics)

Alexis Toumi,Giovanni de Felice

http://arxiv.org/abs/2311.17813v1

Compressor summary: The authors introduce a new type of linguistic model based on diagram-valued functions that can handle non-linear language phenomena and have a Python implementation.


DAP: Domain-aware Prompt Learning for Vision-and-Language Navigation

Ting Liu,Yue Hu,Wansen Wu,Youkai Wang,Kai Xu,Quanjun Yin

http://arxiv.org/abs/2311.17812v1

Compressor summary: The paragraph describes a new method (DAP) for improving vision-and-language models' performance in navigation tasks by learning soft visual prompts from in-domain image-text pairs.


Coloring the Past: Neural Historical Buildings Reconstruction from Archival Photography

David Komorowicz,Lu Sang,Ferdinand Maiwald,Daniel Cremers

http://arxiv.org/abs/2311.17810v1

Compressor summary: The authors propose a volumetric rendering technique to reconstruct 3D models of historical buildings from limited datasets, including color appearance loss and a new historical dataset.


Aggregation Model Hyperparameters Matter in Digital Pathology

Gustav Bredell,Marcel Fischer,Przemyslaw Szostak,Samaneh Abbasi-Sureshjani,Alvaro Gomariz

http://arxiv.org/abs/2311.17804v1

Compressor summary: The paragraph discusses how the performance of feature extractor models in digital pathology depends on the choice of aggregation model hyperparameters, and proposes a comprehensive evaluation approach to understand this relationship better.


Learning to Simulate: Generative Metamodeling via Quantile Regression

L. Jeff Hong,Yanxi Hou,Qingkai Zhang,Xiaowei Zhang

http://arxiv.org/abs/2311.17797v1

Compressor summary: Generative metamodeling is a new technique that quickly generates outputs from complex simulation models for real-time decision-making, while preserving the distribution of inputs, and the paper proposes a new algorithm called quantile-regression-based generative metamodeling (QRGMM).


Marginal Laplacian Score

Guy Hay,Ohad Volk

http://arxiv.org/abs/2311.17795v1

Compressor summary: The paper proposes Marginal Laplacian Score, a modified unsupervised feature selection method for handling imbalanced data, which improves the performance of Differentiable Unsupervised Feature Selection on synthetic and real-world data sets.


Propagate & Distill: Towards Effective Graph Learners Using Propagation-Embracing MLPs

Yong-Min Shin,Won-Yong Shin

http://arxiv.org/abs/2311.17781v1

Compressor summary: The authors propose a method called Propagate & Distill (P&D) to train a student MLP using knowledge distillation from a teacher GNN, which injects structural information by propagating the output of the teacher before distillation.


One-Shot Open Affordance Learning with Foundation Models

Gen Li,Deqing Sun,Laura Sevilla-Lara,Varun Jampani

http://arxiv.org/abs/2311.17776v1

Compressor summary: The paper introduces a vision-language framework for learning object affordances from one example per category, which improves upon existing models' understanding and performance in this task.


Supervising the Centroid Baseline for Extractive Multi-Document Summarization

Simão Gonçalves,Gonçalo Correia,Diogo Pernes,Afonso Mendes

http://arxiv.org/abs/2311.17771v1

Compressor summary: The paragraph describes an enhanced version of the centroid method for extractive multi-document summarization that uses beam search and attention to achieve better performance across multiple languages.


PillarNeSt: Embracing Backbone Scaling and Pretraining for Pillar-based 3D Object Detection

Weixin Mao,Tiancai Wang,Diankun Zhang,Junjie Yan,Osamu Yoshie

http://arxiv.org/abs/2311.17770v1

Compressor summary: The paper improves pillar-based 3D object detection by using pretrained 2D ConvNets as backbones, which adapt to point cloud features like sparsity and irregularity.


Robustness Approaches for the Examination Timetabling Problem under Data Uncertainty

Bernd Bassimir,Rolf Wanka

http://arxiv.org/abs/2311.17766v1

Compressor summary: The authors discuss different robust optimization methods for solving the examination timetabling problem with uncertainty, and evaluate their performance on real and random instances.


Cinematic Behavior Transfer via NeRF-based Differentiable Filming

Xuekun Jiang,Anyi Rao,Jingbo Wang,Dahua Lin,Bo Dai

http://arxiv.org/abs/2311.17754v1

Compressor summary: The authors propose a method for optimizing camera movements and transferring shot types to new videos or virtual environments using NeRF and SMPL techniques.


BAND-2k: Banding Artifact Noticeable Database for Banding Detection and Quality Assessment

Zijian Chen,Wei Sun,Jun Jia,Fangfang Lu,Zicheng Zhang,Jing Liu,Ru Huang,Xiongkuo Min,Guangtao Zhai

http://arxiv.org/abs/2311.17752v1

Compressor summary: The paper presents a large dataset for image banding assessment, proposes an effective method for detecting and evaluating banding artifacts using convolutional neural networks, and shows high correlation between banding intensity and perceptual quality.


Variational Bayes image restoration with compressive autoencoders

Maud Biquard,Marie Chabert,Thomas Oberlin

http://arxiv.org/abs/2311.17744v1

Compressor summary: The paper proposes a new algorithm (VBLE) for regularization in computational imaging using compressive autoencoders, which are smaller and easier to train than generative models, and shows that it performs well and fasts compared to existing methods.


Mukhyansh: A Headline Generation Dataset for Indic Languages

Lokesh Madasu,Gopichand Kanumolu,Nirmal Surange,Manish Shrivastava

http://arxiv.org/abs/2311.17743v1

Compressor summary: Mukhyansh is a large multilingual dataset for headline generation in Indian languages, overcoming challenges due to low-resource and limited data quality.


End-to-end Joint Rich and Normalized ASR with a limited amount of rich training data

Can Cui,Imran Ahamad Sheikh,Mostafa Sadeghi,Emmanuel Vincent

http://arxiv.org/abs/2311.17741v1

Compressor summary: The paper compares two methods to train a speech recognition model that produces transcriptions with punctuation and capitalization, using limited labeled data and achieving different performance on out-of-domain data.


GenZI: Zero-Shot 3D Human-Scene Interaction Generation

Lei Li,Angela Dai

http://arxiv.org/abs/2311.17737v1

Compressor summary: GenZI is a method for generating 3D human-scene interactions using natural language descriptions and no 3D data, by distilling interaction priors from vision-language models and optimizing a 3D model's pose and shape.


Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Chi-Pin Huang,Kai-Po Chang,Chung-Ting Tsai,Yung-Hsuan Lai,Yu-Chiang Frank Wang

http://arxiv.org/abs/2311.17717v1

Compressor summary: Receler is a method to remove specific concepts from text-to-image models by using locality and robustness, improving performance over previous erasing methods.


SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation

Mutian Xu,Xingyilang Yin,Lingteng Qiu,Yang Liu,Xin Tong,Xiaoguang Han

http://arxiv.org/abs/2311.17707v1

Compressor summary: SAMPro3D is a method for segmenting 3D indoor scenes from 2D frames using pretrained SAM, with techniques to improve alignment, quality, and diversity of results without additional training.


How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation

Chenxi Dong

http://arxiv.org/abs/2311.17696v1

Compressor summary: AI Tutor is a web application that uses a large language model to provide personalized, evidence-based tutoring in any subject based on course materials.


Fair Text-to-Image Diffusion via Fair Mapping

Jia Li,Lijie Hu,Jingfeng Zhang,Tianhang Zheng,Hua Zhang,Di Wang

http://arxiv.org/abs/2311.17695v1

Compressor summary: Fair Mapping is a model-agnostic method for generating fair and diverse images from text-to-image diffusion models by controlling prompts and using a linear mapping network.


AviationGPT: A Large Language Model for the Aviation Domain

Liya Wang,Jason Chou,Xin Zhou,Alex Tien,Diane M Baumgartner

http://arxiv.org/abs/2311.17686v1

Compressor summary: AviationGPT is a large language model designed for the aviation domain that can handle various NLP tasks and improve the efficiency and safety of NAS operations.


Improving Minority Stress Detection with Emotions

Jonathan Ivey,Susan Gauch

http://arxiv.org/abs/2311.17676v1

Compressor summary: The authors evaluate psychological stress models on detecting minority stress and suggest using emotion-infused models to improve performance for these vulnerable populations.


TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

Zheng Chu,Jingchang Chen,Qianglong Chen,Weijiang Yu,Haotian Wang,Ming Liu,Bing Qin

http://arxiv.org/abs/2311.17667v1

Compressor summary: The paper introduces TimeBench, a benchmark for testing the temporal reasoning abilities of large language models, which reveals a performance gap between current LLMs and humans.


Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Junyi Ma,Xieyuanli Chen,Jiawei Huang,Jingyi Xu,Zhen Luo,Jintao Xu,Weihao Gu,Rui Ai,Hesheng Wang

http://arxiv.org/abs/2311.17663v1

Compressor summary: The paragraph introduces a new benchmark, Cam4DOcc, for camera-only 4D occupancy forecasting in autonomous driving applications that considers future scene changes based on multiple public datasets and evaluates four baseline methods.


Volumetric Cloud Field Reconstruction

Jacob Lin,Miguel Farinha,Edward Gryspeerdt,Ronald Clark

http://arxiv.org/abs/2311.17657v1

Compressor summary: The paper presents a novel deep learning approach that uses stereo images to reconstruct the shape and dynamics of volumetric phenomena like clouds and fog.


Multiple Toddler Tracking in Indoor Videos

Somaieh Amraee,Bishoy Galoaa,Matthew Goodwin,Elaheh Hatamimajoumerd,Sarah Ostadabbas

http://arxiv.org/abs/2311.17656v1

Compressor summary: The paper presents MTTSort, a method for accurately tracking toddlers in indoor videos using DeepSort algorithm and addressing challenges such as unpredictable movements, occlusions, and limited fields of view.


Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes

Pavel Korshunov,Haolin Chen,Philip N. Garner,Sebastien Marcel

http://arxiv.org/abs/2311.17655v1

Compressor summary: The paper introduces SWAN-DF, a realistic audio-visual deepfakes database that tests the vulnerability of face and speech recognition systems to synthetic media.


VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following

Yujie Lu,Xiujun Li,William Yang Wang,Yejin Choi

http://arxiv.org/abs/2311.17647v1

Compressor summary: VIM is a framework that tests how well multimodal language models understand visual instructions by embedding them in scenes, revealing performance differences among models.


Neural Fields with Thermal Activations for Arbitrary-Scale Super-Resolution

Alexander Becker,Rodrigo Caye Daudt,Nando Metzger,Jan Dirk Wegner,Konrad Schindler

http://arxiv.org/abs/2311.17643v1

Compressor summary: The authors propose a novel way to design neural fields for single image super-resolution that incorporates Gaussian PSF as an anti-aliasing technique without increasing computational cost.


Erasing the Ephemeral: Joint Camera Refinement and Transient Object Removal for Street View Synthesis

Mreenav Shyam Deka,Lu Sang,Daniel Cremers

http://arxiv.org/abs/2311.17634v1

Compressor summary: The paper presents a method for creating new views of outdoor urban scenes using neural point light fields and dynamic object detection, while optimizing camera pose and refining both elements.


Introduction to Transformers: an NLP Perspective

Tong Xiao,Jingbo Zhu

http://arxiv.org/abs/2311.17633v1

Compressor summary: The paper provides an overview of Transformers, their architecture, refinements, applications, and limitations in natural language processing.


Efficient Decoder for End-to-End Oriented Object Detection in Remote Sensing Images

Jiaqi Zhao,Zeyu Ding,Yong Zhou,Hancheng Zhu,Wenliang Du,Rui Yao,Abdulmotaleb El Saddik

http://arxiv.org/abs/2311.17629v1

Compressor summary: The proposed end-to-end oriented detector uses RRoI attention for multi-scale feature alignment and SDQ for efficient query optimization, achieving state-of-the-art performance on multiple datasets.


Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation

Yuan Wang,Naisong Luo,Tianzhu Zhang

http://arxiv.org/abs/2311.17626v1

Compressor summary: The paper presents a new few-shot segmentation model, AMFormer, that focuses on query information and achieves accurate results with minimal support guidance or labels.


ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model

Fukun Yin,Xin Chen,Chi Zhang,Biao Jiang,Zibo Zhao,Jiayuan Fan,Gang Yu,Taihao Li,Tao Chen

http://arxiv.org/abs/2311.17618v1

Compressor summary: ShapeGPT is a multimodal framework that uses language models to generate and edit 3D shapes based on instructions in natural language.


AnyLens: A Generative Diffusion Model with Any Rendering Lens

Andrey Voynov,Amir Hertz,Moab Arar,Shlomi Fruchter,Daniel Cohen-Or

http://arxiv.org/abs/2311.17609v1

Compressor summary: The study presents a framework that combines a text-to-image diffusion model with lens geometry to create realistic images with diverse visual effects like fish-eye and panorama.


Adversarial Robust Memory-Based Continual Learner

Xiaoyue Mi,Fan Tang,Zonghan Yang,Danding Wang,Juan Cao,Peng Li,Yang Liu

http://arxiv.org/abs/2311.17608v1

Compressor summary: The study proposes a new memory-based continual learning method that improves robustness against adversarial attacks by adjusting data logits and using gradient-based data selection.


Topology-Preserving Adversarial Training

Xiaoyue Mi,Fan Tang,Yepeng Weng,Danding Wang,Juan Cao,Sheng Tang,Peng Li,Yang Liu

http://arxiv.org/abs/2311.17607v1

Compressor summary: The study proposes TRAIN, a method that preserves the structure of natural samples in the representation space during adversarial training, improving both natural and robust accuracies on various image datasets.


Continual Learning with Low Rank Adaptation

Martin Wistuba,Prabhu Teja Sivaprasad,Lukas Balles,Giovanni Zappella

http://arxiv.org/abs/2311.17601v1

Compressor summary: The paper proposes CoLoR, a continual learning method that uses Low Rank Adaptation (LoRA) to update pre-trained transformers and maintain their performance on new data without relying on prompt tuning.


Improving embedding of graphs with missing data by soft manifolds

Andrea Marinoni,Pietro Lio',Alessandro Barp,Christian Jutten,Mark Girolami

http://arxiv.org/abs/2311.17598v1

Compressor summary: The paper introduces soft manifolds, a new class of mathematical structures for graph embedding that can handle weighted connections and missing data in complex datasets, leading to more accurate and reliable graph analysis.


Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning

Yiwen Ye,Yutong Xie,Jianpeng Zhang,Ziyang Chen,Qi Wu,Yong Xia

http://arxiv.org/abs/2311.17597v1

Compressor summary: The paper proposes MedCoSS, a continuous self-supervised learning approach for multi-modal medical data that addresses representation conflicts and catastrophic forgetting using rehearsal-based continual learning and feature distillation.


LanGWM: Language Grounded World Model

Rudra P. K. Poudel,Harit Pandya,Chao Zhang,Roberto Cipolla

http://arxiv.org/abs/2311.17593v1

Compressor summary: The authors propose a method called LanGWM that uses language to improve reinforcement learning models' ability to handle out-of-distribution tasks and demonstrate its effectiveness in iGibson point navigation tasks.


SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

Ziqiao Peng,Wentao Hu,Yue Shi,Xiangyu Zhu,Xiaomei Zhang,Hao Zhao,Jun He,Hongyan Liu,Zhaoxin Fan

http://arxiv.org/abs/2311.17590v1

Compressor summary: SyncTalk is a NeRF-based method that improves the realism of talking head videos by synchronizing facial expressions, lip movements, and head poses using innovative techniques.


CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning

Xu Liu,Shu Zhou,Yurong Song,Wenzhe Luo,Xin Zhang

http://arxiv.org/abs/2311.17583v1

Compressor summary: The proposed face liveness detection method uses image-text pairs and contrastive learning to detect eight types of financial field attack behaviors, achieving high performance and robustness on various datasets.


LoCoMotif: Discovering time-warped motifs in time series

Daan Van Wesenbeeck,Aras Yurtman,Wannes Meert,Hendrik Blockeel

http://arxiv.org/abs/2311.17582v1

Compressor summary: LoCoMotif is a novel method for time series motif discovery that overcomes existing limitations and performs better in a physiotherapy use case.


LGFCTR: Local and Global Feature Convolutional Transformer for Image Matching

Wenhao Zhong,Jie Jiang

http://arxiv.org/abs/2311.17571v1

Compressor summary: The paper proposes a novel convolutional transformer that captures both local and global features for image matching under extreme conditions, outperforming existing methods.


Bias Resilient Multi-Step Off-Policy Goal-Conditioned Reinforcement Learning

Lisheng Wu,Ke Chen

http://arxiv.org/abs/2311.17565v1

Compressor summary: The paper analyzes two types of off-policy biases in goal-conditioned reinforcement learning, proposes solutions to leverage their benefits, and shows improved efficiency and performance in challenging ten-step scenarios.


Interpreting Differentiable Latent States for Healthcare Time-series Data

Yu Chen,Nivedita Bijlani,Samaneh Kouchaki,Payam Barnaghi

http://arxiv.org/abs/2311.17560v1

Compressor summary: The paper presents an algorithm to interpret latent states and predictions in machine learning models, which can help identify patterns and predict patient outcomes in digital healthcare.


VINNA for Neonates -- Orientation Independence through Latent Augmentations

Leonie Henschel,David Kügler,Lilla Zöllei,Martin Reuter

http://arxiv.org/abs/2311.17546v1

Compressor summary: The paper presents VINNA, a method for segmenting neonatal brain images that uses resolution-aware internal augmentations and 4-DOF transform module to improve accuracy and robustness.


TaskWeaver: A Code-First Agent Framework

Bo Qiao,Liqun Li,Xu Zhang,Shilin He,Yu Kang,Chaoyun Zhang,Fangkai Yang,Hang Dong,Jue Zhang,Lu Wang,Minghua Ma,Pu Zhao,Si Qin,Xiaoting Qin,Chao Du,Yong Xu,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang

http://arxiv.org/abs/2311.17541v1

Compressor summary: TaskWeaver is a framework that uses LLMs to create chatbots with rich data structures, flexible plugins, and secure code execution for complex tasks in specific domains.


The Effects of Overparameterization on Sharpness-aware Minimization: An Empirical and Theoretical Analysis

Sungbin Shin,Dongyeop Lee,Maksym Andriushchenko,Namhoon Lee

http://arxiv.org/abs/2311.17539v1

Compressor summary: The paper investigates how overparameterization affects sharpness-aware minimization (SAM) and finds that it improves generalization, convergence rate, and stability of minima in neural networks.


Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning

Liang Peng,Haoran Cheng,Zheng Yang,Ruisi Zhao,Linxuan Xia,Chaotian Song,Qinglin Lu,Wei Liu,Boxi Wu

http://arxiv.org/abs/2311.17536v1

Compressor summary: This paper proposes a noise constraint for one-shot video tuning methods to improve consistency and smoothness, and introduces a new metric to evaluate video smoothness better.


Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Xingqun Qi,Jiahao Pan,Peng Li,Ruibin Yuan,Xiaowei Chi,Mengfei Li,Wenhan Luo,Wei Xue,Shanghang Zhang,Qifeng Liu,Yike Guo

http://arxiv.org/abs/2311.17532v1

Compressor summary: The paper proposes a novel method for generating realistic 3D co-speech gestures with emotional transitions using ChatGPT-4, audio inpainting, weakly supervised training, and keyframe sampling.


HiDiffusion: Unlocking High-Resolution Creativity and Efficiency in Low-Resolution Trained Diffusion Models

Shen Zhang,Zhaowei Chen,Zhenyu Zhao,Zhenyuan Chen,Yao Tang,Yuhao Chen,Wengang Cao,Jiajun Liang

http://arxiv.org/abs/2311.17528v1

Compressor summary: HiDiffusion is a framework that improves high-resolution image synthesis by adjusting feature map size and using dynamic window attention in U-Net, achieving state-of-the-art performance without tuning.