arxiv compressed, 2024-01-05

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-01-05 generated by the compressor, my personal LLM-based project.


Learning to Prompt with Text Only Supervision for Vision-Language Models

Muhammad Uzair Khattak,Muhammad Ferjad Naeem,Muzammal Naseer,Luc Van Gool,Federico Tombari

http://arxiv.org/abs/2401.02418v1

Compressor summary: The authors propose a method to learn prompts for vision-language tasks using only text data derived from large language models, enabling zero-shot transfer and reducing costs.


ODIN: A Single Model for 2D and 3D Perception

Ayush Jain,Pushkal Katara,Nikolaos Gkanatsios,Adam W. Harley,Gabriel Sarch,Kriti Aggarwal,Vishrav Chaudhary,Katerina Fragkiadaki

http://arxiv.org/abs/2401.02416v1

Compressor summary: ODIN is a transformer model that simultaneously segments and labels 2D images and 3D point clouds, achieving state-of-the-art performance on various 3D perception benchmarks.


LLaMA Pro: Progressive LLaMA with Block Expansion

Chengyue Wu,Yukang Gan,Yixiao Ge,Zeyu Lu,Jiahao Wang,Ye Feng,Ping Luo,Ying Shan

http://arxiv.org/abs/2401.02415v1

Compressor summary: The paper proposes a new method to improve LLMs by expanding Transformer blocks and tuning them with a new corpus, achieving better performance on general tasks, programming, and math.


Bring Metric Functions into Diffusion Models

Jie An,Zhengyuan Yang,Jianfeng Wang,Linjie Li,Zicheng Liu,Lijuan Wang,Jiebo Luo

http://arxiv.org/abs/2401.02414v1

Compressor summary: The Cas-DM improves DDPM by using two modules that predict noise and clean image, allowing metric functions like LPIPS loss to improve image quality.


LLM Augmented LLMs: Expanding Capabilities through Composition

Rachit Bansal,Bidisha Samanta,Siddharth Dalmia,Nitish Gupta,Shikhar Vashishth,Sriram Ganapathy,Abhishek Bapna,Prateek Jain,Partha Talukdar

http://arxiv.org/abs/2401.02412v1

Compressor summary: CALM is a method that composes foundation models with specific models using cross-attention to enable new capabilities while preserving existing ones and improving efficiency.


What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs

Alex Trevithick,Matthew Chan,Towaki Takikawa,Umar Iqbal,Shalini De Mello,Manmohan Chandraker,Ravi Ramamoorthi,Koki Nagano

http://arxiv.org/abs/2401.02411v1

Compressor summary: Our method improves the resolution and consistency of 3D geometry generated by 3D GANs using high-resolution neural volume rendering without post-processing superresolution.


Real-Time 2D Temperature Field Prediction in Metal Additive Manufacturing Using Physics-Informed Neural Networks

Pouyan Sajadi,Mostafa Rahmani Dehaghani,Yifan Tang,G. Gary Wang

http://arxiv.org/abs/2401.02403v1

Compressor summary: The paper presents a physics-informed neural network that predicts temperature fields in metal additive manufacturing using real-time data and can handle different scenarios.


3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Zihao Xiao,Longlong Jing,Shangxuan Wu,Alex Zihao Zhu,Jingwei Ji,Chiyu Max Jiang,Wei-Chih Hung,Thomas Funkhouser,Weicheng Kuo,Anelia Angelova,Yin Zhou,Shiwei Sheng

http://arxiv.org/abs/2401.02402v1

Compressor summary: Our paper proposes a novel 3D open-vocabulary panoptic segmentation method that fuses LiDAR and vision features, using a single classification head and new distillation losses to improve performance on novel classes.


Learning the 3D Fauna of the Web

Zizhang Li,Dor Litvak,Ruining Li,Yunzhi Zhang,Tomas Jakab,Christian Rupprecht,Shangzhe Wu,Andrea Vedaldi,Jiajun Wu

http://arxiv.org/abs/2401.02400v1

Compressor summary: The authors propose 3D-Fauna, a method that learns a deformable 3D animal model from 2D images and the Semantic Bank of Skinned Models to handle rare species with limited data.


Generating synthetic data for neural operators

Erisa Hasani,Rachel A. Ward

http://arxiv.org/abs/2401.02398v1

Compressor summary: The paper proposes a new method to create synthetic training data for deep learning neural operators without using classical numerical solvers for partial differential equations (PDEs).


TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang,Guangtao Zeng,Tianduo Wang,Wei Lu

http://arxiv.org/abs/2401.02385v1

Compressor summary: TinyLlama is a small, efficient language model that performs well on various tasks and is available for free on GitHub.


ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

Fanqing Meng,Wenqi Shao,Quanfeng Lu,Peng Gao,Kaipeng Zhang,Yu Qiao,Ping Luo

http://arxiv.org/abs/2401.02384v1

Compressor summary: The text introduces ChartAssistant, a vision-language model for universal chart comprehension and reasoning, which outperforms existing methods without task-specific fine-tuning.


Survey of 3D Human Body Pose and Shape Estimation Methods for Contemporary Dance Applications

Darshan Venkatrayappa,Alain Tremeau,Damien Muselet,Philippe Colantoni

http://arxiv.org/abs/2401.02383v1

Compressor summary: The study compares 3D body shape and pose estimation methods for contemporary dance, showing that multi-frame methods perform better than single-frame methods.


SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded Entity Retrieval

Griffin Adams,Jason Zucker,Noémie Elhadad

http://arxiv.org/abs/2401.02369v1

Compressor summary: Scientists fine-tune LLMs to generate hospital discharge summaries, using a smaller encoder model to predict salient entities and sentence-level planning with entity retrieval (SPEER) to improve coverage and faithfulness.


Integration of physics-informed operator learning and finite element method for parametric learning of partial differential equations

Shahed Rezaei,Ahmad Moeineddin,Michael Kaliske,Markus Apel

http://arxiv.org/abs/2401.02363v1

Compressor summary: The paper proposes a physics-informed deep learning method that solves steady-state heat equations in heterogeneous solids faster and more accurately than classical methods or pure data-driven approaches.


An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Xiangyu Zhao,Yicheng Chen,Shilin Xu,Xiangtai Li,Xinjiang Wang,Yining Li,Haian Huang

http://arxiv.org/abs/2401.02361v1

Compressor summary: MM-Grounding-DINO is an open-source version of Grounding-DINO, a state-of-the-art model for multiple vision tasks, with comprehensive technical details and better performance.


Fit-NGP: Fitting Object Models to Neural Graphics Primitives

Marwan Taher,Ignacio Alzugaray,Andrew J. Davison

http://arxiv.org/abs/2401.02357v1

Compressor summary: The paper presents a system that uses a density field from an efficient radiance field method to accurately and robustly estimate the pose of small, challenging objects with reflective surfaces using a single wrist-mounted camera on a robot arm.


A Survey Analyzing Generalization in Deep Reinforcement Learning

Ezgi Korkmaz

http://arxiv.org/abs/2401.02349v1

Compressor summary: The paper discusses the challenges of overfitting and generalization in deep reinforcement learning, and proposes a unified framework to improve robustness.


Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Longtian Qiu,Shan Ning,Xuming He

http://arxiv.org/abs/2401.02347v1

Compressor summary: The paper proposes a text-only trained zero-shot image captioning framework that leverages subregion features and reduces the modality gap between images and texts using noise injection and CLIP reranking, improving performance on common datasets.


Linguistic Profiling of Deepfakes: An Open Database for Next-Generation Deepfake Detection

Yabin Wang,Zhiwu Huang,Zhiheng Ma,Xiaopeng Hong

http://arxiv.org/abs/2401.02335v1

Compressor summary: The paper introduces DFLIP-3K, a large and diverse deepfake database that enables the development of convincing and explainable deepfake detection methods through linguistic profiling of deepfakes.


Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models

Uday Allu,Biddwan Ahmed,Vishesh Tripathi

http://arxiv.org/abs/2401.02333v1

Compressor summary: The paper proposes an improved method for finding accurate information from complex tables in PDF documents using RAG-based systems and various natural language processing techniques.


LLaVA-$φ$: Efficient Multi-Modal Assistant with Small Language Model

Yichen Zhu,Minjie Zhu,Ning Liu,Zhicai Ou,Xiaofeng Mou,Jian Tang

http://arxiv.org/abs/2401.02330v1

Compressor summary: LLaVA-Phi is a small but powerful multi-modal assistant that uses a tiny language model to interact effectively with both text and images, making it suitable for real-time applications and resource-efficient systems.


ClassWise-SAM-Adapter: Parameter Efficient Fine-tuning Adapts Segment Anything to SAR Domain for Semantic Segmentation

Xinyang Pu,Hecheng Jia,Linghao Zheng,Feng Wang,Feng Xu

http://arxiv.org/abs/2401.02326v1

Compressor summary: The ClassWise-SAM-Adapter (CWSAM) adapts a large vision foundation model to efficiently classify landcover on SAR images using fewer resources and outperforming conventional methods.


A Robust Quantile Huber Loss With Interpretable Parameter Adjustment In Distributional Reinforcement Learning

Parvin Malekzadeh,Konstantinos N. Plataniotis,Zissis Poulos,Zeyu Wang

http://arxiv.org/abs/2401.02325v1

Compressor summary: The paper proposes a new quantile Huber loss function based on Wasserstein distance that improves robustness and generalization in distributional reinforcement learning by accounting for noise in predicted and target quantiles.


BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

Yiran Song,Qianyu Zhou,Xiangtai Li,Deng-Ping Fan,Xuequan Lu,Lizhuang Ma

http://arxiv.org/abs/2401.02317v1

Compressor summary: Key points: - The paper proposes Scalable Bias-Mode Attention Mask (BA-SAM) to improve Segment Anything Model (SAM) performance on images with varying resolutions - BA-SAM introduces a scaling factor and a bias-mode attention mask to adapt to different token sequence lengths and prioritize neighboring information - BA-SAM achieves better or state-of-the-art results in zero-shot and fine-tuning settings on diverse datasets Summary: The paper presents BA-SAM, a method that enhances SAM's adaptability to varying image resolutions by adjusting the attention layer and prioritizing neighboring information, leading to better or state-of-the-art segmentation results.


SuperEdge: Towards a Generalization Model for Self-Supervised Edge Detection

Leng Kai,Zhang Zhijie,Liu Jie,Zed Boukhers,Sui Wei,Cong Yang,Li Zhijun

http://arxiv.org/abs/2401.02313v1

Compressor summary: This paper introduces a self-supervised approach for edge detection using multi-level, multi-homography techniques to transfer annotations from synthetic to real data, and SuperEdge, a model that extracts edges at pixel and object levels.


TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

Hao Sun,Mingyao Zhou,Wenjing Chen,Wei Xie

http://arxiv.org/abs/2401.02309v1

Compressor summary: The paper proposes a new method, TR-DETR, that leverages the reciprocal relationship between video moment retrieval and highlight detection tasks to improve performance on these related tasks using a local-global multi-modal alignment module and a task cooperation module.


Robust Physics Informed Neural Networks

Marcin Łoś,Maciej Paszyński

http://arxiv.org/abs/2401.02300v1

Compressor summary: The authors propose a Robust Physics-Informed Neural Network (RPINN) that improves the loss function in standard PINNs by incorporating the residual and the inverse of the Gram matrix, resulting in a more accurate approximation of PDE solutions with better convergence.


Are LLMs Robust for Spoken Dialogues?

Seyed Mahed Mousavi,Gabriel Roccabruna,Simone Alghisi,Massimo Rizzoli,Mirco Ravanelli,Giuseppe Riccardi

http://arxiv.org/abs/2401.02297v1

Compressor summary: This paper evaluates the performance of large language models on spoken task-oriented dialogues and finds that they are not robust to noise by default, but can be improved with fine-tuning.


Training Single-Layer Morphological Perceptron Using Convex-Concave Programming

Iara Cunha,Marcos Eduardo Valle

http://arxiv.org/abs/2401.02296v1

Compressor summary: The paper presents a new training algorithm for single-layer morphological perceptrons that combines two existing methods and shows its effectiveness in binary classification problems.


GridFormer: Point-Grid Transformer for Surface Reconstruction

Shengtao Li,Ge Gao,Yudong Liu,Yu-Shen Liu,Ming Gu

http://arxiv.org/abs/2401.02292v1

Compressor summary: The paper introduces GridFormer, a method that uses an attention mechanism between grid and point features for efficient 3D surface reconstruction with high precision.


Path-based Explanation for Knowledge Graph Completion

Heng Chang,Jiangnan Ye,Alejo Lopez Avila,Jinhua Du,Jia Li

http://arxiv.org/abs/2401.02290v1

Compressor summary: Power-Link is a novel path-based explainer for Knowledge Graph Completion models that uses a simplified graph-powering technique to generate interpretable explanations efficiently and scalably.


Distillation-based fabric anomaly detection

Simon Thomine,Hichem Snoussi

http://arxiv.org/abs/2401.02287v1

Compressor summary: Key points: - The text is about unsupervised fabric defect detection using a knowledge distillation-based approach. - The approach is based on reverse distillation, which avoids reconstructing anomalies and mitigates classifier bias. - The method is fast and robust for different types of fabrics and defects. Summary: The text proposes a fast and robust unsupervised fabric defect detection method using reverse distillation, which prevents anomaly reconstruction and reduces classifier bias.


PEGASUS: Physically Enhanced Gaussian Splatting Simulation System for 6DOF Object Pose Dataset Generation

Lukas Meyer,Floris Erich,Yusuke Yoshiyasu,Marc Stamminger,Noriaki Ando,Yukiyasu Domae

http://arxiv.org/abs/2401.02281v1

Compressor summary: PEGASUS is a versatile dataset generator that creates realistic scenes by combining environments and objects using 3D Gaussian Splatting and physics simulation, enabling pose estimation networks to transfer from synthetic to real-world data.


Lightweight Fish Classification Model for Sustainable Marine Management: Indonesian Case

Febrian Kurniawan,Gandeva Bayu Satrya,Firuz Kamalov

http://arxiv.org/abs/2401.02278v1

Compressor summary: Key points: - The study proposes a machine learning model to classify fish species and their consumability - The model uses a modified MobileNet that is lightweight and runs on limited hardware - The model is trained on a large dataset of Indonesian fish images and achieves high accuracy - The model can help prevent overfishing and protect marine resources Summary: The study develops a machine learning model that can identify fish species and whether they are edible, using a lightweight MobileNet. The model, trained on Indonesian fish images, could aid sustainable fishing and conserve marine life.


Universal Approximation Theorem for Vector- and Hypercomplex-Valued Neural Networks

Marcos Eduardo Valle,Wington L. Vital,Guilherme Vieira

http://arxiv.org/abs/2401.02277v1

Compressor summary: The paper extends the universal approximation theorem to various vector-valued neural networks, including hypercomplex ones, by defining non-degenerate algebras and applying it to neural networks on these algebras.


ShapeAug: Occlusion Augmentation for Event Camera Data

Katharina Bendig,René Schuster,Didier Stricker

http://arxiv.org/abs/2401.02274v1

Compressor summary: The paper proposes a new event data augmentation method for DVS classification and object detection that introduces synthetic events for moving objects and improves accuracy.


Uncertainty-Aware Deep Attention Recurrent Neural Network for Heterogeneous Time Series Imputation

Linglong Qian,Zina Ibrahim,Richard Dobson

http://arxiv.org/abs/2401.02258v1

Compressor summary: DEARI is a deep recurrent neural network that jointly imputes missing values and their uncertainty in heterogeneous multivariate time series, outperforming the current state-of-the-art methods.


Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems

Yuma Tsuta,Naoki Yoshinaga,Shoetsu Sato,Masashi Toyoda

http://arxiv.org/abs/2401.02256v1

Compressor summary: The study examines how to create an automatic response evaluator for open-domain dialogue systems that considers the human perspective and interlocutor awareness.


Balancing Continual Learning and Fine-tuning for Human Activity Recognition

Chi Ian Tang,Lorena Qendro,Dimitris Spathis,Fahim Kawsar,Akhil Mathur,Cecilia Mascolo

http://arxiv.org/abs/2401.02255v1

Compressor summary: The paper proposes two models for wearable-based human activity recognition using continual learning, comparing them with existing approaches and exploring the balance between retention and adaptation.


L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Aishwarya Mirashi,Srushti Sonavane,Purva Lingayat,Tejas Padhiyar,Raviraj Joshi

http://arxiv.org/abs/2401.02254v1

Compressor summary: L3Cube-IndicNews is a multilingual text classification corpus for Indian regional languages, covering news headlines, articles, and sub-articles in 10 languages, with evaluation using different models.


Policy-regularized Offline Multi-objective Reinforcement Learning

Qian Lin,Chao Yu,Zongkai Liu,Zifan Wu

http://arxiv.org/abs/2401.02244v1

Compressor summary: The paper proposes a method to train a policy for multi-objective RL using only offline trajectory data, and addresses the preference-inconsistent demonstration problem by filtering, regularizing, and adapting the policy.


Slot-guided Volumetric Object Radiance Fields

Di Qi,Tong Yang,Xiangyu Zhang

http://arxiv.org/abs/2401.02241v1

Compressor summary: sVORF is a novel unsupervised method for decomposing complex scenes into individual objects from a single image using volumetric object radiance fields and object slots guided by transformers.


U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting

Xiang Ma,Xuemei Li,Lexin Fang,Tianlong Zhao,Caiming Zhang

http://arxiv.org/abs/2401.02236v1

Compressor summary: U-Mixer is a framework that combines Unet and Mixer to tackle non-stationarity in time series forecasting by correcting stationarity and preserving temporal dependencies, achieving better performance than SOTA methods.


Trajectory-Oriented Policy Optimization with Sparse Rewards

Guojian Wang,Faguo Wu,Xiao Zhang

http://arxiv.org/abs/2401.02225v1

Compressor summary: The text introduces a new DRL method that uses offline demonstrations as guidance to learn policies faster in tasks with sparse rewards, using a novel trajectory distance based on MMD for optimization.


Joint Multi-Facts Reasoning Network For Complex Temporal Question Answering Over Knowledge Graph

Rikui Huang,Wei Wei,Xiaoye Qu,Wenfeng Xie,Xianling Mao,Dangyang Chen

http://arxiv.org/abs/2401.02212v1

Compressor summary: JMFRN is a model that jointly reasons about multiple temporal facts from a knowledge graph to answer complex questions, using entity-aware and time-aware attention modules and an answer type discrimination task.


DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models

Songbo Hu,Xiaobin Wang,Zhangdie Yuan,Anna Korhonen,Ivan Vulić

http://arxiv.org/abs/2401.02208v1

Compressor summary: DIALIGHT is an open-source toolkit for developing and evaluating multilingual task-oriented dialogue systems using pretrained and large language models, with a focus on systematic comparisons and user feedback.


Location Aware Modular Biencoder for Tourism Question Answering

Haonan Li,Martin Tomko,Timothy Baldwin

http://arxiv.org/abs/2401.02187v1

Compressor summary: Key points: - The paper proposes a dense vector retrieval approach for tourism QA, using pretrained language models and location encoder to encode questions and POIs separately. - The method is efficient, effective, and outperforms previous methods on a real-world dataset. - The method allows global evaluation with a larger search space than previous work. Summary: The paper presents an efficient and effective method for tourism QA using dense vector retrieval, which encodes questions and POIs separately with pretrained language models and location encoder, and enables global evaluation with a large search space.


FairGridSearch: A Framework to Compare Fairness-Enhancing Models

Shih-Chi Ma,Tatiana Ermakova,Benjamin Fabian

http://arxiv.org/abs/2401.02183v1

Compressor summary: FairGridSearch is a framework for comparing fairness-enhancing models in binary classification, considering various factors such as metric selection, base estimator choice, and classification threshold, which can affect model fairness differently across datasets.


Prompt Decoupling for Text-to-Image Person Re-identification

Weihao Li,Lei Tan,Pingyang Dai,Yan Zhang

http://arxiv.org/abs/2401.02173v1

Compressor summary: The paper proposes a two-stage training approach with prompt tuning strategy to improve text-to-image person re-identification using CLIP model by decoupling domain adaptation and task adaptation.


Frequency Domain Nuances Mining for Visible-Infrared Person Re-identification

Yukang Zhang,Yang Lu,Yan Yan,Hanzi Wang,Xuelong Li

http://arxiv.org/abs/2401.02162v1

Compressor summary: This paper proposes a novel method to reduce modality discrepancy in visible-infrared person re-identification by exploring frequency domain information, which improves performance on two datasets.


Enhancing RAW-to-sRGB with Decoupled Style Structure in Fourier Domain

Xuanhua He,Tao Hu,Guoli Wang,Zejin Wang,Run Wang,Qian Zhang,Keyu Yan,Ziyi Chen,Rui Li,Chenjun Xie,Jie Zhang,Man Zhou

http://arxiv.org/abs/2401.02161v1

Compressor summary: The authors propose FourierISP, a novel neural network framework that enhances both color and structure of smartphone RAW images by separating them in the frequency domain and using three subnetworks for optimization.


Shayona@SMM4H23: COVID-19 Self diagnosis classification using BERT and LightGBM models

Rushi Chavda,Darshan Makwana,Vraj Patel,Anupam Shukla

http://arxiv.org/abs/2401.02158v1

Compressor summary: The paper reports on Team Shayona's success in two shared tasks involving binary classification of COVID-19 tweets and social anxiety Reddit posts using BERT and LightGBM models.


Disentangle Estimation of Causal Effects from Cross-Silo Data

Yuxuan Liu,Haozhao Wang,Shuang Wang,Zhiming He,Wenchao Xu,Jialiang Zhu,Fan Yang

http://arxiv.org/abs/2401.02154v1

Compressor summary: Key points: - The text is about a method to estimate causal effects among events with private data silos - The method uses disentangle architecture, shared and private branches, and global constraints - The method improves the accuracy of causal effect estimation compared to existing methods Summary: The text proposes a new method that can estimate causal effects among events with private data silos using disentangle architecture, shared and private branches, and global constraints, which improves the accuracy of causal effect estimation.


Frequency-Adaptive Pan-Sharpening with Mixture of Experts

Xuanhua He,Keyu Yan,Rui Li,Chengjun Xie,Jie Zhang,Man Zhou

http://arxiv.org/abs/2401.02151v1

Compressor summary: The FAME learning framework is a novel method for pan-sharpening that uses frequency domain techniques to reconstruct missing high-frequency information in multi-spectral images, outperforming existing methods.


Marginal Debiased Network for Fair Visual Recognition

Mei Wang,Weihong Deng,Sen Su

http://arxiv.org/abs/2401.02150v1

Compressor summary: The paper proposes a novel network to learn debiased representations by using a marginal softmax loss that penalizes spurious correlations and adapts margin parameters through meta learning.


Exploring Boundary of GPT-4V on Marine Analysis: A Preliminary Case Study

Ziqiang Zheng,Yiwei Chen,Jipeng Zhang,Tuan-Anh Vu,Huimin Zeng,Yue Him Wong Tim,Sai-Kit Yeung

http://arxiv.org/abs/2401.02147v1

Compressor summary: The study evaluates GPT-4V's performance on marine analysis tasks and finds it lacking domain-specific knowledge, setting a new standard for future MLLM developments.


Graph Neural Networks for Tabular Data Learning: A Survey with Taxonomy and Directions

Cheng-Te Li,Yu-Che Tsai,Chih-Yao Chen,Jay Chiehen Liao

http://arxiv.org/abs/2401.02143v1

Compressor summary: The text surveys Graph Neural Networks (GNNs) for Tabular Data Learning (TDL), highlighting their strengths, challenges, applications, and future directions.


GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

Xuehao Gao,Yang Yang,Zhenyu Xie,Shaoyi Du,Zhongqian Sun,Yang Wu

http://arxiv.org/abs/2401.02142v1

Compressor summary: The paper introduces GUESS, a generative framework that synthesizes human motion from text descriptions using a cascaded latent diffusion model and multi-condition fusion mechanism to improve accuracy, realisticness, and diversity.


Bayesian Intrinsic Groupwise Image Registration: Unsupervised Disentanglement of Anatomy and Geometry

Xinzhe Luo,Xin Wang,Linda Shapiro,Chun Yuan,Jianfeng Feng,Xiahai Zhuang

http://arxiv.org/abs/2401.02141v1

Compressor summary: Key points: - A general Bayesian learning framework for multi-modal groupwise registration on medical images - Probabilistic modelling of image generative process with latent variables for common anatomy and geometric variations - Hierarchical variational auto-encoding architecture for inference of latent variables - Unsupervised closed-loop self-reconstruction without complex similarity measures - Computationally efficient, scalable and flexible disentangled architecture - Inferred structural representations with visual semantics - Superior performance over conventional similarity-based approaches in multiple experiments Summary: The article presents a novel Bayesian learning framework that uses hierarchical variational auto-encoding to perform multi-modal groupwise registration on medical images, achieving superior accuracy, efficiency, scalability and interpretability without complex similarity measures.


Explore Human Parsing Modality for Action Recognition

Jinfu Liu,Runwei Ding,Yuhang Wen,Nan Dai,Fanyang Meng,Shen Zhao,Mengyuan Liu

http://arxiv.org/abs/2401.02138v1

Compressor summary: The paper introduces a new dual-branch framework called EPP-Net that uses both skeletons and human parsing features for action recognition, improving performance over existing methods.


SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Ziping Ma,Furong Xu,Jian Liu,Ming Yang,Qingpei Guo

http://arxiv.org/abs/2401.02137v1

Compressor summary: SyCoCa is a multimodal alignment method that improves contrastive captioning by introducing bidirectional interactions between images and texts at both global and local levels using textual and visual cues.


DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

Wendi Cui,Jiaxin Zhang,Zhuohang Li,Lopez Damien,Kamalika Das,Bradley Malin,Sricharan Kumar

http://arxiv.org/abs/2401.02132v1

Compressor summary: The paper proposes DCR, a framework to evaluate and improve the consistency of text generated by LLMs using divide-conquer-reasoning, which outperforms existing methods in multiple tasks and reduces hallucination issues.


Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

Jiacheng Wang,Ping Liu,Wei Xu

http://arxiv.org/abs/2401.02126v1

Compressor summary: Key points: - Existing methods struggle with combining rigid and non-rigid edits in text-to-image tasks - Proposed framework can handle both types of edits using text or reference images - Framework uses dual-path injection, self-attention, and latent fusion to improve quality and versatility Summary: The paper presents a novel text-to-image editing framework that can perform rigid and non-rigid edits guided by text or reference images, using techniques like dual-path injection, self-attention, and latent fusion to achieve high quality and flexibility.


PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques

Tzu-Han Lin,How-Shing Wang,Hao-Yung Weng,Kuang-Chen Peng,Zih-Ching Chen,Hung-yi Lee

http://arxiv.org/abs/2401.02122v1

Compressor summary: PEFT methods have varying effects on speech processing, and using an ensemble approach with majority voting improves performance over DARTS and a baseline method.


Using LLM to select the right SQL Query from candidates

Zhenwen Li,Tao Xie

http://arxiv.org/abs/2401.02115v1

Compressor summary: Key points: - Text-to-SQL models generate candidate SQL queries and need a re-rank method to select the best one - Previous studies use test cases for code generation, but not for text-to-SQL - The proposed method generates databases and uses LLMs to predict execution results as test cases - The re-rank method selects the best SQL query based on pass numbers and generation probabilities - The experiment results show a 3.6\% improvement for some state-of-the-art models Summary: The paper proposes a text-to-SQL re-rank method that uses LLMs to predict execution results as test cases, and selects the best SQL query based on pass numbers and generation probabilities, improving the performance of some models by 3.6\%.


Source-Free Online Domain Adaptive Semantic Segmentation of Satellite Images under Image Degradation

Fahim Faisal Niloy,Kishor Kumar Bhaumik,Simon S. Woo

http://arxiv.org/abs/2401.02113v1

Compressor summary: Key points: - The paper proposes a novel test-time adaptation (TTA) approach for satellite image segmentation - The approach estimates global Batch Normalization statistics and refines predicted masks using global class centers - The method is backpropagation-free, fast, and lightweight Summary: The paper presents a fast and lightweight TTA method for satellite image segmentation that adapts to distribution shifts by estimating global Batch Normalization statistics and refining predicted masks.


Significance of Anatomical Constraints in Virtual Try-On

Debapriya Roy,Sanchayan Santra,Diganta Mukherjee,Bhabatosh Chanda

http://arxiv.org/abs/2401.02110v1

Compressor summary: The paper presents a new system for virtual try-on (VTON) of clothes, addressing the limitations of existing methods by using anatomy-aware geometric transformations and part-based warping.


CLAPP: Contrastive Language-Audio Pre-training in Passive Underwater Vessel Classification

Zeyu Li,Jingsheng Gao,Tong Yu,Suncheng Xiang,Jiacheng Ruan,Ting Liu,Yuzhuo Fu

http://arxiv.org/abs/2401.02099v1

Compressor summary: The text introduces CLAPP, a novel model that uses contrastive language-audio pre-training to improve underwater vessel classification and recognition from raw audio data and vessel state text pairs.


Preserving Image Properties Through Initializations in Diffusion Models

Jeffrey Zhang,Shao-Yu Chang,Kedan Li,David Forsyth

http://arxiv.org/abs/2401.02097v1

Compressor summary: The paper discusses how to improve retail photography using Stable Diffusion methods by addressing inconsistencies in image generation and backgrounds.


Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe

Mincong Huang,Chao Wang,Chi Ma,Yineng Zhang,Peng Zhang,Lei Yu

http://arxiv.org/abs/2401.02088v1

Compressor summary: BPipe improves memory utilization for large Transformer models like GPT-3, but not for LLaMA, and its benefits depend on flash attention.


View-based Explanations for Graph Neural Networks

Tingyang Chen,Dazhuo Qiu,Yinghui Wu,Arijit Khan,Xiangyu Ke,Yunjun Gao

http://arxiv.org/abs/2401.02086v1

Compressor summary: The paper proposes GVEX, a method to generate graph views for explanation, which helps understand specific class labels assigned by graph neural networks (GNNs) in analytical tasks like graph classification.


Energy based diffusion generator for efficient sampling of Boltzmann distributions

Yan Wang,Ling Guo,Hao Wu,Tao Zhou

http://arxiv.org/abs/2401.02080v1

Compressor summary: The energy-based diffusion generator is a new sampler that uses a variational autoencoder with a diffusion model encoder and a generalized Hamiltonian dynamics decoder to generate samples from arbitrary target distributions, outperforming existing methods.


Leveraging SAM for Single-Source Domain Generalization in Medical Image Segmentation

Hanhui Wang,Huaize Ye,Yi Xia,Xueyan Zhang

http://arxiv.org/abs/2401.02076v1

Compressor summary: The paper proposes an improved single-source domain generalization (SDG) method for medical image segmentation using a parallel framework with the Segment Anything Model (SAM).


ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers

Chen Zheng,Ke Sun,Da Tang,Yukun Ma,Yuyu Zhang,Chenguang Xi,Xun Zhou

http://arxiv.org/abs/2401.02072v1

Compressor summary: ICE-GRT is a new AI model that uses Reinforcement Learning from Human Feedback to improve its understanding and reasoning abilities in domain-specific tasks without sacrificing general task performance, outperforming other large language models.


Neural Collapse for Cross-entropy Class-Imbalanced Learning with Unconstrained ReLU Feature Model

Hien Dang,Tho Tran,Tan Nguyen,Nhat Ho

http://arxiv.org/abs/2401.02058v1

Compressor summary: The paper studies how neural collapse, a phenomenon where deep neural networks' last-layer features become extreme points of a simplex, changes when the training data is imbalanced between classes.


Generalizable vision-language pre-training for annotation-free pathology localization

Hao Yang,Hong-Yu Zhou,Cheng Li,Weijian Huang,Jiarun Liu,Shanshan Wang

http://arxiv.org/abs/2401.02044v1

Compressor summary: AFLoc is a new model that can find diseases in medical images without expert annotations, using multi-level learning and image-text alignment to adapt to diverse pathologies and outperform existing methods.


Efficient Cloud-edge Collaborative Inference for Object Re-identification

Chuanming Wang,Yuxin Yang,Mengshi Qi,Huadong Ma

http://arxiv.org/abs/2401.02041v1

Compressor summary: Key points: - Current ReID system is centralized and impractical for many videos - Propose a cloud-edge collaborative inference framework for ReID systems - Introduce DaCM to model spatial-temporal correlations among instances and reduce transmission overhead Summary: The paper proposes a cloud-edge collaboration framework for ReID systems with DaCM, a model that leverages spatial-temporal correlations to improve efficiency and scalability.


Understanding LLMs: A Comprehensive Overview from Training to Inference

Yiheng Liu,Hao He,Tianle Han,Xu Zhang,Mengyuan Liu,Jiaming Tian,Yutong Zhang,Jiaqi Wang,Xiaohui Gao,Tianyang Zhong,Yi Pan,Shaochen Xu,Zihao Wu,Zhengliang Liu,Xin Zhang,Shu Zhang,Xintao Hu,Tuo Zhang,Ning Qiang,Tianming Liu,Bao Ge

http://arxiv.org/abs/2401.02038v1

Compressor summary: This paper reviews how ChatGPT's introduction boosted Large Language Models' usage for downstream tasks, emphasizing cost-efficient training and deployment techniques and their evolution.


Text2MDT: Extracting Medical Decision Trees from Medical Texts

Wei Zhu,Wenfeng Li,Xing Tian,Pengfei Wang,Xiaoling Wang,Jin Chen,Yuanbin Wu,Yuan Ni,Guotong Xie

http://arxiv.org/abs/2401.02034v1

Compressor summary: Key points: - The text proposes a new task, Text2MDT, to extract medical decision trees from texts automatically - It introduces an annotated dataset in Chinese and two methods for the task: end-to-end with LLMs and pipeline with encoder-based models - It reports promising results of the end-to-end method and the COT prompting technique, and comparable performance of the pipeline method Summary: The text presents Text2MDT, a novel task to generate medical decision trees from texts using large language models, and evaluates two methods with an annotated Chinese dataset.


DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection

Yunfan Ye,Kai Xu,Yuhang Huang,Renjiao Yi,Zhiping Cai

http://arxiv.org/abs/2401.02032v1

Compressor summary: The DiffusionEdge model uses a diffusion probabilistic approach to improve edge detection accuracy and sharpness, achieving superior results on various datasets.


Spy-Watermark: Robust Invisible Watermarking for Backdoor Attack

Ruofei Wang,Renjie Wan,Zongyu Guo,Qing Guo,Rui Huang

http://arxiv.org/abs/2401.02031v1

Compressor summary: Spy-Watermark is a novel backdoor attack method that uses a learnable watermark embedded in images to deceive victim models while resisting data corruption and defense measures.


From Function to Distribution Modeling: A PAC-Generative Approach to Offline Optimization

Qiang Zhang,Ruida Zhou,Yang Shen,Tie Liu

http://arxiv.org/abs/2401.02019v1

Compressor summary: The paper proposes a new approach to offline optimization that views it as sampling from a generative model and uses re-weighting with a PAC lower bound to learn a weight function and a score-based generative model, achieving robustly competitive performance on benchmarks.


Improving Diffusion-Based Image Synthesis with Context Prediction

Ling Yang,Jingwei Liu,Shenda Hong,Zhilong Zhang,Zhilin Huang,Zheming Cai,Wentao Zhang,Bin Cui

http://arxiv.org/abs/2401.02015v1

Compressor summary: ConPreDiff is a diffusion model that predicts neighborhood context for better image synthesis using a context decoder during training and removing it for inference.


SwitchTab: Switched Autoencoders Are Effective Tabular Learners

Jing Wu,Suiyao Chen,Qi Zhao,Renat Sergazinov,Chen Li,Shengjie Liu,Chongchao Zhao,Tianpei Xie,Hanqing Guo,Cheng Ji,Daniel Cociorva,Hakan Brunzel

http://arxiv.org/abs/2401.02013v1

Compressor summary: SwitchTab is a novel self-supervised method for tabular data that captures latent dependencies and improves downstream tasks by producing more representative embeddings and enhancing traditional classification methods.


Fast & Fair: Efficient Second-Order Robust Optimization for Fairness in Machine Learning

Allen Minch,Hung Anh Vu,Anne Marie Warren

http://arxiv.org/abs/2401.02012v1

Compressor summary: The project aims to create fairer AI models by using adversarial training techniques that address the bias inherent in Deep Neural Networks.


Decentralized Multi-Task Online Convex Optimization Under Random Link Failures

Wenjing Yan,Xuanyu Cao

http://arxiv.org/abs/2401.02011v1

Compressor summary: The paper proposes a robust decentralized saddle-point algorithm for multi-task optimization with random link failures, achieving regret and constraint violation bounds matching those of perfect communication scenarios.


Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

Wenqi Zhang,Yongliang Shen,Linjuan Wu,Qiuying Peng,Jun Wang,Yueting Zhuang,Weiming Lu

http://arxiv.org/abs/2401.02009v1

Compressor summary: Self-Contrast improves Large Language Model's reflection by exploring diverse perspectives, contrasting differences, and summarizing them into a checklist for re-evaluation and error reduction.


Two-Stage Surrogate Modeling for Data-Driven Design Optimization with Application to Composite Microstructure Generation

Farhad Pourkamali-Anaraki,Jamal F. Husseini,Evan J. Pineda,Brett A. Bednarcyk,Scott E. Stapleton

http://arxiv.org/abs/2401.02008v1

Compressor summary: The paper presents a new two-stage machine learning framework that solves inverse problems by identifying promising input designs and evaluating them efficiently using conformal inference.