This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-06-12 generated by the compressor, my personal LLM-based project.
http://arxiv.org/abs/2406.07550v1
Compressor summary: TiTok is a Transformer-based 1D tokenizer that efficiently encodes images into latent sequences, achieving competitive performance and significant speedup compared to conventional 2D tokenization methods.
http://arxiv.org/abs/2406.07551v1
Compressor summary: BSSTNet is a video deblurring method that uses spatio-temporal sparse transformers, blur maps, and bidirectional feature propagation to improve performance on challenging blurry video frames.
http://arxiv.org/abs/2406.07548v1
Compressor summary: The paper introduces a new transformer-based tokenizer that compresses images and videos to a hypersphere using binary quantization, achieving high reconstruction quality and competitive compression results.
http://arxiv.org/abs/2406.07547v1
Compressor summary: Imitative editing allows users to edit images by drawing inspiration from other pictures, and MimicBrush is a generative model that learns to recover masked regions using semantic correspondence between separate images.
http://arxiv.org/abs/2406.07546v1
Compressor summary: The paper introduces a new task and benchmark called Commonsense-T2I that evaluates text-to-image models' ability to produce images that fit commonsense reasoning in real life scenarios using adversarial text prompts.
http://arxiv.org/abs/2406.07545v1
Compressor summary: The text discusses the issues of selection bias and random guessing in multiple-choice questions for evaluating large language models and proposes a new benchmark using open-style questions to better assess their capabilities.
http://arxiv.org/abs/2406.07544v1
Compressor summary: SIG3D is a model that enables robots to understand 3D scenes from language and answer questions based on their position in the scene.
http://arxiv.org/abs/2406.07543v1
Compressor summary: Key points: - Vision model pre-training uses web-crawled image-text data - Latent Compression Learning (LCL) exploits interleaved image-text data - LCL maximizes mutual information between inputs and outputs of a causal attention model - LCL performs two tasks: contrastive learning and text generation - LCL matches CLIP on paired datasets and learns robust visual representation from scratch on interleaved data Summary: LCL is a novel pre-training method that uses interleaved image-text data to learn robust visual representation by maximizing mutual information between inputs and outputs of a causal attention model, performing contrastive learning and text generation.
http://arxiv.org/abs/2406.07542v1
Compressor summary: The authors propose a multimodal model that uses audio and text features to predict cognitive decline and diagnose Mild Cognitive Impairment using bilingual interviews.
http://arxiv.org/abs/2406.07541v1
Compressor summary: Our method improves conservative offline reinforcement learning by adjusting actions using denoising scores from the dataset density gradient, enhancing generalization and risk aversion.
http://arxiv.org/abs/2406.07540v1
Compressor summary: Ctrl-X is a framework that enables fine-grained structure and appearance control in text-to-image and text-to-video models without additional training or guidance, improving image quality and transferability.
http://arxiv.org/abs/2406.07537v1
Compressor summary: The paper shows that autoregressive pretraining can improve Mamba's visual capabilities, speed up training, and achieve higher ImageNet accuracy than supervised training for large Mamba models.
http://arxiv.org/abs/2406.07536v1
Compressor summary: The paper proposes an efficient model selection scheme called isolated model embedding, which can quickly update and select models for tasks using a single sweep over vectors.
http://arxiv.org/abs/2406.07529v1
Compressor summary: The paper introduces a new model-merging method that considers trade-offs between tasks by finding a Pareto front of scaling coefficients using amortized inference and efficient computation techniques.
http://arxiv.org/abs/2406.07528v1
Compressor summary: Q-LLM is a system that enhances large language models' ability to understand long sequences and answer questions by focusing on relevant memory data.
http://arxiv.org/abs/2406.07524v1
Compressor summary: The paper proposes a simple masked discrete diffusion method that improves the performance of language models and achieves state-of-the-art results among diffusion models.
http://arxiv.org/abs/2406.07522v1
Compressor summary: Samba is a hybrid architecture that combines Mamba (a selective State Space Model) and Sliding Window Attention to efficiently model sequences with infinite context length, achieving state-of-the-art results and significant speedup compared to Transformers.
http://arxiv.org/abs/2406.07520v1
Compressor summary: Neural Gaffer is an end-to-end diffusion model that can accurately relight any image under new lighting conditions without decomposing the scene into its components.
http://arxiv.org/abs/2406.07516v1
Compressor summary: AvatarPopUp generates fast and high-quality 3D human avatars from images, text, and body controls using diffusion-based image synthesis networks and a 3D lifting network.
http://arxiv.org/abs/2406.07515v1
Compressor summary: The text discusses using feedback on generated data to prevent model collapse in fine-tuning large language models and provides theoretical and practical examples.
http://arxiv.org/abs/2406.07507v1
Compressor summary: Flow map matching is a fast and flexible generative model that learns the two-time flow map of an ODE, unifying existing few-step models and outperforming diffusion and stochastic interpolants in sample quality and efficiency.
http://arxiv.org/abs/2406.07506v1
Compressor summary: The study investigates how well large multimodal models can learn new visual concepts from single word embedding perturbations and finds that the models' performance depends on their specific embeddings, which are not transferable.
http://arxiv.org/abs/2406.07505v1
Compressor summary: The authors present a new method, Financial Analyst Extension, that improves the performance of large language models on financial analysis tasks and provide a new dataset, Flare CFA, to evaluate them.
http://arxiv.org/abs/2406.07504v1
Compressor summary: The study shows that voice cloning models can reduce or increase the perception of a speaker's "gay voice", affecting accessibility and raising ethical concerns.
http://arxiv.org/abs/2406.07502v1
Compressor summary: The paper introduces Image Textualization (IT), a framework that uses multi-modal language models and vision expert models to create detailed image descriptions, and proposes benchmarks for evaluating them.
http://arxiv.org/abs/2406.07500v1
Compressor summary: SPIN is an open-source image generation tool for realistic spacecraft navigation that improves visual-based algorithms' performance by providing diverse ground-truth data and customization options.
http://arxiv.org/abs/2406.07499v1
Compressor summary: Trim 3D Gaussian Splatting (TrimGS) removes inaccurate 3D geometry from images by selectively trimming 3D Gaussians and preserves accurate structures, leading to better reconstruction results.
http://arxiv.org/abs/2406.07496v1
Compressor summary: TextGrad is a framework that automates optimization of compound AI systems using textual feedback from large language models, achieving significant improvements in various applications such as question answering, molecule optimization, and radiotherapy treatment planning.
http://arxiv.org/abs/2406.07494v1
Compressor summary: Key points: - The article reviews Transformer-based abstractive summarization for English dialogues - It covers the main challenges and techniques in dialog summarization - It evaluates the datasets, metrics, and human assessment methods used in the field - It suggests that large language models may change the relevance and difficulty of the task Summary: The article systematically reviews the research on using Transformers to summarize English dialogues, highlighting the challenges, techniques, and evaluation methods in the field, and discussing the potential impact of large language models.
http://arxiv.org/abs/2406.07492v1
Compressor summary: The paper explores using automatic affirmative paraphrases to enhance language models' performance on tasks involving negation.
http://arxiv.org/abs/2406.07488v1
Compressor summary: ReduceFormer is a fast transformer model that uses simple operations instead of expensive ones like matrix multiplication and Softmax, making it suitable for low-latency and high-throughput applications.
http://arxiv.org/abs/2406.07487v1
Compressor summary: The paper proposes GLAD, an adaptive diffusion model for unsupervised anomaly detection that adjusts to different anomalies and retains normal information by introducing synthetic abnormal samples in training and using a spatial-adaptive feature fusion scheme during inference.
http://arxiv.org/abs/2406.07484v1
Compressor summary: The study shows that a Transformer model outperforms other deep learning models and a simple approach in predicting streamflow across 125 locations in Iowa using data from the past 72 hours.
http://arxiv.org/abs/2406.07483v1
Compressor summary: The paper evaluates eight LLMs' performance in annotating stance in social media posts, finding that their accuracy depends on the explicitness of the text and suggesting a hybrid approach with human expertise.
http://arxiv.org/abs/2406.07482v1
Compressor summary: The Bhutanese government uses technology to aid crop identification in Paro, and a study shows that deep learning approaches can effectively predict rice types and extents using satellite imagery.
http://arxiv.org/abs/2406.07480v1
Compressor summary: The paper proposes a method to train diffusion models on continuous images (image neural fields) instead of fixed-resolution ones, which improves their performance in various tasks.
http://arxiv.org/abs/2406.07476v1
Compressor summary: VideoLLaMA 2 is a video-oriented large language model that enhances spatial-temporal and audio understanding for various tasks and outperforms existing models.
http://arxiv.org/abs/2406.07475v1
Compressor summary: The paper proposes a new algorithm (PO-MFL) to infer trajectories from partially observed latent SDEs using entropic OT and shows its effectiveness in comparison to a latent-free method.
http://arxiv.org/abs/2406.07472v1
Compressor summary: The paper presents a method for generating realistic and dynamic 4D scenes using video generative models, without relying on pre-trained 3D models or synthetic data.
http://arxiv.org/abs/2406.07471v1
Compressor summary: OphNet is a large video dataset for understanding ophthalmic surgeries with detailed annotations for surgical phases, operations, and time-localized information.
http://arxiv.org/abs/2406.07466v1
Compressor summary: The paper introduces a multimodal approach to predicting speaker belief commitment using text, audio, and fusion methods, and reports results on the CBP corpus.
http://arxiv.org/abs/2406.07457v1
Compressor summary: The paper proposes a method to estimate the likelihood of incorrect predictions by conditional generative models in in-context learning tasks based on their response log probabilities.
http://arxiv.org/abs/2406.07456v1
Compressor summary: The Fractional Kolmogorov-Arnold Network (fKAN) is a new neural network architecture that combines the benefits of Kolmogorov-Arnold Networks with fractional Jacobi functions, leading to faster and more accurate learning across various tasks.
http://arxiv.org/abs/2406.07455v1
Compressor summary: The paper proposes a model-free reinforcement learning algorithm for human feedback, called $\mathsf{BSAD}$, which identifies the optimal policy directly from preference data using a dueling bandit sub-routine, and shows its sample complexity and generalization abilities.
http://arxiv.org/abs/2406.07451v1
Compressor summary: The paper proposes an online framework to evaluate and compare generative models using multi-armed bandit methods and FID/IS metrics to find the best model quickly and efficiently.
http://arxiv.org/abs/2406.07450v1
Compressor summary: The study benchmarks eight contrastive methods for multimodal medical representation learning, finding that general-domain representations transfer well, multimodal training is sufficient, and fine-grained features are beneficial.
http://arxiv.org/abs/2406.07444v1
Compressor summary: The text discusses document-level relation extraction (DocRE) models' lack of robustness to entity name variations, introduces two new benchmarks, and proposes a training method to improve robustness and reasoning capabilities.
http://arxiv.org/abs/2406.07440v1
Compressor summary: Textual similarity, a new quality estimation metric for machine translation, shows stronger correlations with human scores than traditional methods and may improve accuracy and usability in translation systems.
http://arxiv.org/abs/2406.07438v1
Compressor summary: DeformTime is a neural network architecture that uses deformable attention blocks to capture correlated temporal patterns and improve multivariate time series forecasting accuracy.
http://arxiv.org/abs/2406.07435v1
Compressor summary: BOA-Restormer is a robust image restoration model that uses alias-free paths in its transformer-based architecture, allowing it to preserve high-frequency information without sacrificing performance.
http://arxiv.org/abs/2406.07430v1
Compressor summary: The text introduces ConDA-TTA, a model that adapts to detecting out-of-context news on unlabeled domains using contrastive learning and test-time statistics, and shows its effectiveness on two datasets.
http://arxiv.org/abs/2406.07424v1
Compressor summary: The paper introduces MINERS, a benchmark to evaluate multilingual language models in semantic retrieval tasks across over 200 languages, showing their effectiveness in finding semantically similar words.
http://arxiv.org/abs/2406.07423v1
Compressor summary: The paper proposes a benchmark to evaluate sampling methods from intractable distributions using standardized tasks and performance criteria, and introduces new metrics for mode collapse.
http://arxiv.org/abs/2406.07418v1
Compressor summary: Our iterative gene panel selection strategy for single-cell genomics clustering uses reinforcement learning to optimize gene selection efficiently and adaptively, integrating preliminary boundaries from other algorithms and reducing biases.
http://arxiv.org/abs/2406.07413v1
Compressor summary: The paper presents a novel framework for incremental learning in graphs that improves memory diversity and selection to handle complex tasks, outperforming existing methods.
http://arxiv.org/abs/2406.07407v1
Compressor summary: The paper presents two efficient private algorithms for computing the geometric median of a dataset with an error guarantee based on its diameter, and a less efficient algorithm with pure differential privacy.
http://arxiv.org/abs/2406.07404v1
Compressor summary: The paper proposes a new feature transformation method that leverages a graph structure to preserve historical knowledge, improve exploration efficiency, and enable backtracking for adaptability.
http://arxiv.org/abs/2406.07402v1
Compressor summary: The text discusses the application of machine learning and deep learning to knowledge graphs using embeddings, focusing on random walk-based methods.
http://arxiv.org/abs/2406.07400v1
Compressor summary: The study explores how separating control and data when providing guidance to a Large Language Model (LLM) can improve the generation of specifications for reactive program synthesis using temporal logics.
http://arxiv.org/abs/2406.07399v1
Compressor summary: The study proposes a novel deep learning network for automotive radar imaging that improves spatial resolution and quality by harnessing radar signal processing domain knowledge, resulting in better performance than existing methods.
http://arxiv.org/abs/2406.07398v1
Compressor summary: The paper proposes a stochastic frame prediction model that learns uncertainty and temporal information for image representation learning, and shows its effectiveness on various video-based tasks.
http://arxiv.org/abs/2406.07394v1
Compressor summary: MCTSr is an algorithm that combines Large Language Models with Monte Carlo Tree Search to improve mathematical reasoning performance by exploring, refining, and evaluating decisions.
http://arxiv.org/abs/2406.07393v1
Compressor summary: This paper evaluates LLMs' out-of-context knowledge reasoning abilities using a synthetic dataset with seven tasks and finds that they struggle to retrieve relevant knowledge, both within and across languages.
http://arxiv.org/abs/2406.07381v1
Compressor summary: The paper proposes a new model-based reinforcement learning approach that uses large language models to generate hints for exploration in challenging long-horizon tasks with sparse rewards.
http://arxiv.org/abs/2406.07378v1
Compressor summary: The text discusses using large language models as an alternative to experts for creating causal graphs, with a proposed voting schema to improve their performance and some evidence of causal reasoning in the model's answers.
http://arxiv.org/abs/2406.07368v1
Compressor summary: The paper studies how to improve the efficiency and performance of autoregressive large language models using linear attention, speculative decoding, and an augmentation technique for linear attention that works with speculative decoding.
http://arxiv.org/abs/2406.07365v1
Compressor summary: The paper proposes a few-shot aspect sentiment quad prediction (ASQP) method using broadview soft prompting (BvSP) to improve adaptation in real applications.
http://arxiv.org/abs/2406.07361v1
Compressor summary: The authors propose a method that combines deep learning and optimization to improve image registration, achieving better performance, robustness, and flexibility in transformation representations.
http://arxiv.org/abs/2406.07359v1
Compressor summary: \sys is a summarization method for scholarly reviews that extracts common and unique opinions using Rational Speech Act framework-based uniqueness scores, helping area chairs discern reviewers' arguments more efficiently.
http://arxiv.org/abs/2406.07358v1
Compressor summary: The paper investigates how language models can strategically underperform on dangerous capability evaluations, which weakens their trustworthiness and affects AI safety decisions.
http://arxiv.org/abs/2406.07353v1
Compressor summary: The text surveys 119 papers on content-based computational analysis of toxic memes, introduces a new taxonomy for categorizing meme toxicity types, and identifies three dimensions of meme toxicity under automatic study.
http://arxiv.org/abs/2406.07348v1
Compressor summary: DR-RAG is a two-stage retrieval framework that uses document parts and query to mine relevance, improving answer accuracy and efficiency in knowledge-intensive tasks like multi-hop QA.
http://arxiv.org/abs/2406.07340v1
Compressor summary: The paper presents a formally verified, executable algorithm for approximate policy iteration on Factored Markov Decision Processes using Isabelle/HOL, with verified software for certifying Linear Programming solutions.
http://arxiv.org/abs/2406.07337v1
Compressor summary: AFT is a method to adaptively transfer useful features from large foundation models to small task-specific models for better performance with minimal overhead.
http://arxiv.org/abs/2406.07333v1
Compressor summary: The paper introduces a novel zero-shot texture anomaly detection method named GRNR, which uses intrinsic support priors to reconstruct query samples without any training data or cost.
http://arxiv.org/abs/2406.07330v1
Compressor summary: The paper explores how to improve non-autoregressive speech-to-speech translation using pretraining, knowledge distillation, and advanced training techniques, achieving comparable quality to autoregressive models with much faster decoding.
http://arxiv.org/abs/2406.07329v1
Compressor summary: Our method uses multi-view LDR images with varying exposure, aperture, and focus to reconstruct an HDR radiance field for flexible refocusing in real-time cinematic rendering.
http://arxiv.org/abs/2406.07327v1
Compressor summary: This paper revisits Direct Preference Optimization (DPO) for aligning large language models with human preferences, examines its limitations, and proposes regularization methods to improve it.
http://arxiv.org/abs/2406.07325v1
Compressor summary: The paper proposes a new parameterization method, called $\delta$-sampling, to adjust the behavior of deep reinforcement learning agents during solution construction for scheduling problems, leading to better exploration of the search space and improved solution quality.
http://arxiv.org/abs/2406.07320v1
Compressor summary: The paper presents a statistical framework for model evaluation in machine learning and computer vision using tailored sampling, estimation strategies, and k-means clustering, which leads to more precise and cost-effective accuracy estimates compared to traditional methods.
http://arxiv.org/abs/2406.07318v1
Compressor summary: The text introduces a new approach to process event data using graph convolutional networks, optimizing PointNet++ for low latency and energy consumption on FPGA-based systems.
http://arxiv.org/abs/2406.07314v1
Compressor summary: The paper proposes a robust graph neural network approach to handle noisy labels in graph classification tasks, improving both utility and privacy.
http://arxiv.org/abs/2406.07302v1
Compressor summary: BertaQA is a bilingual trivia dataset in English and Basque that tests large language models' knowledge of local and global culture, revealing the challenges and opportunities for cross-lingual knowledge transfer.