arxiv compressed, 2024-08-07

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-08-07 generated by the compressor, my personal LLM-based project.


LLaVA-OneVision: Easy Visual Task Transfer

http://arxiv.org/abs/2408.03326v1

Compressor summary: LLaVA-OneVision is a multimodal model that excels in single-image, multi-image, and video scenarios and enables strong transfer learning across different modalities/scenarios.


CoverBench: A Challenging Benchmark for Complex Claim Verification

http://arxiv.org/abs/2408.03325v1

Compressor summary: CoverBench is a benchmark for verifying language models' outputs in complex reasoning settings by using diverse datasets from various domains, types of reasoning, and standardizations.


Training LLMs to Recognize Hedges in Spontaneous Narratives

http://arxiv.org/abs/2408.03319v1

Compressor summary: The study analyzes hedges in Roadrunner cartoon narratives, comparing three LLM-based approaches for hedge detection and improving the gold standard coding.


Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

http://arxiv.org/abs/2408.03314v1

Compressor summary: This paper studies how using more test-time computation can improve LLMs' performance on open-ended natural language tasks and proposes a "compute-optimal" strategy to adaptively allocate infer-time compute per prompt, leading to significant efficiency improvements.


MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

http://arxiv.org/abs/2408.03312v1

Compressor summary: The authors introduce a novel Masked Diffusion Transformer for co-speech gesture generation, which improves contextual reasoning and incorporates multi-modal information, achieving faster learning and inference speeds compared to traditional models.


Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks

http://arxiv.org/abs/2408.03304v1

Compressor summary: The authors propose a method to improve the automatic tracing of elaborate illustrations on Etruscan mirrors using a deep neural network that interactively refines existing annotations based on human guidance, achieving higher quality with less manual input and faster speed.


TextIM: Part-aware Interactive Motion Synthesis from Text

http://arxiv.org/abs/2408.03302v1

Compressor summary: TextIM is a framework for generating realistic human interactive motions based on textual descriptions, focusing on aligning part-level semantics and achieving spatial coherence.


KaPO: Knowledge-aware Preference Optimization for Controllable Knowledge Selection in Retrieval-Augmented Language Models

http://arxiv.org/abs/2408.03297v1

Compressor summary: KaPO is a method to improve large language models' ability to handle knowledge conflicts and select relevant information in real retrieval scenarios using preference optimization.


DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers

http://arxiv.org/abs/2408.03291v1

Compressor summary: DopQ-ViT is a new method to compress vision transformers by using a distribution-friendly Tan Quantizer and a scaling factor compensation technique, which improves accuracy and efficiency in low-bit settings.


SARA: Singular-Value Based Adaptive Low-Rank Adaption

http://arxiv.org/abs/2408.03290v1

Compressor summary: The paper proposes SARA and Mo-SARA, which are adaptive low-rank methods for fine-tuning pre-trained models that adjust the rank based on layer performance and reduce parameters by selectively updating singular values.


Malicious Internet Entity Detection Using Local Graph Inference

http://arxiv.org/abs/2408.03287v1

Compressor summary: Key points: - The paper proposes a new method for detecting malicious behavior in large networks using graph data and neural networks - The method achieves high expressivity and scalability by modeling network entity interactions as a heterogeneous graph and performing local graph inference - The method outperforms the state-of-the-art PTP algorithm and generalizes well to new entities Summary: The paper introduces a novel graph-based neural network method for detecting malicious behavior in large networks, which is more expressive, scalable, and robust than existing approaches.


ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

http://arxiv.org/abs/2408.03284v1

Compressor summary: The paper proposes ReSyncer, a framework that creates high-fidelity lip-synced videos with various features suitable for virtual presenters and performers.


AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval

http://arxiv.org/abs/2408.03282v1

Compressor summary: Key points: - The paper proposes a transformer-based model for image retrieval re-ranking with low memory usage (1KB per image) - The model can estimate asymmetric similarity between query and database images using local descriptors - The model adapts to different applications and outperforms current methods in both performance and memory efficiency Summary: The paper presents a transformer-based image retrieval re-ranking model that can estimate similarity with low memory, adapt to different applications, and achieve better performance than existing methods.


StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

http://arxiv.org/abs/2408.03281v1

Compressor summary: StructEval is a new framework that evaluates large language models by conducting structured assessments across multiple cognitive levels and critical concepts, improving reliability and consistency in model evaluation.


Synthesizing Text-to-SQL Data from Weak and Strong LLMs

http://arxiv.org/abs/2408.03256v1

Compressor summary: The paper proposes a synthetic data method to improve text-to-SQL models' domain generalization and error supervision, leading to SENSE, an open-source model that outperforms existing methods on SPIDER and BIRD benchmarks.


Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons

http://arxiv.org/abs/2408.03247v1

Compressor summary: The paper examines how Large Language Models use their stored knowledge for reasoning tasks, finding that they sometimes rely on shortcuts instead of recalling facts accurately, and explores techniques to improve their reasoning performance.


Making Long-Context Language Models Better Multi-Hop Reasoners

http://arxiv.org/abs/2408.03246v1

Compressor summary: The paper proposes Reasoning with Attributions, a method to improve language models' ability to reason across multiple steps and handle noisy contexts by asking them to explain their assertions.


Analysis of Partially-Calibrated Sparse Subarrays for Direction Finding with Extended Degrees of Freedom

http://arxiv.org/abs/2408.03236v1

Compressor summary: The paper proposes a DOA estimation algorithm for partially-calibrated sparse subarrays, using coarray properties to estimate more sources than physical sensors and outperforming other methods.


Contrastive Learning for Image Complexity Representation

http://arxiv.org/abs/2408.03230v1

Compressor summary: The authors present MoCo v2, a framework that uses contrastive learning to represent image complexity and enhance computer vision tasks without expensive manual annotations or human biases.


Don't Think It Twice: Exploit Shift Invariance for Efficient Online Streaming Inference of CNNs

http://arxiv.org/abs/2408.03223v1

Compressor summary: StreamiNNC optimizes computational efficiency for deep learning time-series processing with overlapping windows by exploiting shift-invariance and reducing zero-padding and pooling errors.


Learning to Learn without Forgetting using Attention

http://arxiv.org/abs/2408.03219v1

Compressor summary: The paper proposes a meta-learning transformer-based optimizer to enhance continual learning in machine learning models by selectively updating parameters and preventing unnecessary forgetting.


IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

http://arxiv.org/abs/2408.03209v1

Compressor summary: IPAdapter-Instruct is a method that uses natural-image conditioning and Instruct prompts to switch between different image generation tasks with one model, improving efficiency and quality.


A Debiased Nearest Neighbors Framework for Multi-Label Text Classification

http://arxiv.org/abs/2408.03202v1

Compressor summary: The paper introduces DENN, a framework for multi-label text classification that mitigates embedding alignment and confidence estimation biases in the $k$NN approach by proposing debiased contrastive learning and confidence estimation strategies.


RELIEF: Reinforcement Learning Empowered Graph Feature Prompt Tuning

http://arxiv.org/abs/2408.03195v1

Compressor summary: RELIEF is a method that uses reinforcement learning to strategically add feature prompts to certain graph nodes to improve task performance and data efficiency.


Efficient NeRF Optimization -- Not All Samples Remain Equally Hard

http://arxiv.org/abs/2408.03193v1

Compressor summary: The paper proposes an online hard sample mining technique for efficient NeRF training that reduces compute time, memory usage, and improves view-synthesis quality.


An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

http://arxiv.org/abs/2408.03178v1

Compressor summary: The "Object Images" approach creates realistic 3D models by representing complex shapes as 2D images, allowing for efficient image generation and PBR material support.


Leveraging Parameter Efficient Training Methods for Low Resource Text Classification: A Case Study in Marathi

http://arxiv.org/abs/2408.03172v1

Compressor summary: The paper explores Parameter Efficient Fine-Tuning methods for low-resource Marathi BERT models and shows they can achieve competitive results with less computational cost.


Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study

http://arxiv.org/abs/2408.03164v1

Compressor summary: DCLS improves model interpretability by aligning it with human visual attention, and Threshold-Grad-CAM enhances interpretability further for some models.


Iterative CT Reconstruction via Latent Variable Optimization of Shallow Diffusion Models

http://arxiv.org/abs/2408.03156v1

Compressor summary: The study proposes a new CT reconstruction method that combines denoising diffusion models with iterative CT reconstruction, optimizing the fidelity loss and suppressing anatomical structure changes, resulting in high-quality images while preserving structures.


TSC: A Simple Two-Sided Constraint against Over-Smoothing

http://arxiv.org/abs/2408.03152v1

Compressor summary: The paper proposes a Two-Sided Constraint for Graph Convolutional Networks to address both causes of over-smoothing by using random masking and contrastive constraint, improving node discriminability and reducing convergence issues.


Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization

http://arxiv.org/abs/2408.03149v1

Compressor summary: EGMS is a multimodal summarization model that uses dual encoders and a gating mechanism to integrate fine-grained entity knowledge from images for enhanced textual summary generation.


SuperSimpleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection

http://arxiv.org/abs/2408.03143v1

Compressor summary: SuperSimpleNet is a discriminative model that detects surface defects on objects using normal or abnormal training images, improving performance and speed over previous methods.


Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

http://arxiv.org/abs/2408.03130v1

Compressor summary: This literature review explores different techniques for compressing large language models, such as quantization, pruning, knowledge distillation, and architectural optimizations, and categorizes them into a taxonomy.


Lisbon Computational Linguists at SemEval-2024 Task 2: Using A Mistral 7B Model and Data Augmentation

http://arxiv.org/abs/2408.03127v1

Compressor summary: The paper presents a method for classifying statements about clinical trial reports using Mistral-7B, an open-source large language model, with promising results and some limitations.


COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework

http://arxiv.org/abs/2408.03125v1

Compressor summary: COMMENTATOR is a tool that speeds up multilingual text annotation, especially for code-mixed languages like Hinglish.


Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile Baseline

http://arxiv.org/abs/2408.03120v1

Compressor summary: The paper introduces a new dataset with text descriptions for plant diseases, which can help recognize them better in real-world images and handle variability within and between classes.


Evaluating the Translation Performance of Large Language Models Based on Euas-20

http://arxiv.org/abs/2408.03119v1

Compressor summary: The paper introduces Euas-20, a dataset to evaluate large language models' translation abilities across different languages and the impact of pre-training data.


Topic Modeling with Fine-tuning LLMs and Bag of Sentences

http://arxiv.org/abs/2408.03099v1

Compressor summary: The paper proposes FT-Topic, an unsupervised fine-tuning method for LLMs using sentence groups to improve topic modeling performance and introduce SenClu, a fast inference algorithm based on the approach.


500xCompressor: Generalized Prompt Compression for Large Language Models

http://arxiv.org/abs/2408.03094v1

Compressor summary: The 500xCompressor method compresses natural language contexts into one special token, improving inference speed, reducing costs, and enhancing user experience without requiring fine-tuning of the large language model.


Learning Provably Robust Policies in Uncertain Parametric Environments

http://arxiv.org/abs/2408.03093v1

Compressor summary: Key points: - Data-driven approach for learning robust MDP policies across unknown environments - Interval MDP model built from finite samples of trajectories - Synthesize one policy that meets requirements and bound its risk - Trade-off between performance and risk - Exploit state space, graph structure, and parametric structure knowledge Summary: The paper proposes a method to learn robust MDP policies using interval MDP models from finite samples of trajectories, synthesizing one policy that meets requirements and bounding its risk, while leveraging environment knowledge.


Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement

http://arxiv.org/abs/2408.03092v1

Compressor summary: The paper proposes WIDEN, a method to merge large language models with different parameter changes, improving their capabilities in various tasks.


Enhancing Complex Causality Extraction via Improved Subtask Interaction and Knowledge Fusion

http://arxiv.org/abs/2408.03079v1

Compressor summary: The paper proposes UniCE, a unified framework for event causality extraction that addresses complex causality, subtask interaction, and knowledge fusion, achieving state-of-the-art results and outperforming ChatGPT.


Towards an Analysis of Discourse and Interactional Pragmatic Reasoning Capabilities of Large Language Models

http://arxiv.org/abs/2408.03074v1

Compressor summary: The paper surveys how pragmatic abilities are tested in large language models, dividing them into discourse and interactional pragmatics, and discussing their analysis methods.


Probing structural constraints of negation in Pretrained Language Models

http://arxiv.org/abs/2408.03070v1

Compressor summary: This paper investigates how pretrained language models encode negation and its impact on neighboring words, finding that they capture negation scope and its effect on NPI licensing.


Analysis of Argument Structure Constructions in a Deep Recurrent Language Model

http://arxiv.org/abs/2408.03062v1

Compressor summary: This study trains an LSTM network on a custom dataset to explore ASC representation and processing in the brain, finding that the model can effectively differentiate between various construction types and may reflect linguistic processing in humans.


MGFs: Masked Gaussian Fields for Meshing Building based on Multi-View Images

http://arxiv.org/abs/2408.03060v1

Compressor summary: The paper proposes Masked Gaussian Fields (MGFs), which generate accurate building surface reconstructions from images using multi-level masks and innovative losses, improving accuracy and efficiency over traditional methods and other state-of-the-art solutions.


Targeted Visual Prompting for Medical Visual Question Answering

http://arxiv.org/abs/2408.03043v1

Compressor summary: The paper introduces targeted visual prompting to improve region-based question answering in medical images using multimodal large language models.


L3iTC at the FinLLM Challenge Task: Quantization for Financial Text Classification & Summarization

http://arxiv.org/abs/2408.03033v1

Compressor summary: The article discusses L3iTC's participation in a financial text challenge, where they fine-tuned large language models for classification and summarization using low GPU memory and 4-bit quantization.


Nighttime Pedestrian Detection Based on Fore-Background Contrast Learning

http://arxiv.org/abs/2408.03030v1

Compressor summary: The study proposes Fore-Background Contrast Attention (FBCA), which uses background information to improve pedestrian detection under low-light conditions, achieving state-of-the-art results on three datasets.


Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

http://arxiv.org/abs/2408.03029v1

Compressor summary: The text proposes a novel reward shaping method for reinforcement learning that uses success rates from historical experiences and Beta distributions to balance exploration and exploitation efficiently.


CKNN: Cleansed k-Nearest Neighbor for Unsupervised Video Anomaly Detection

http://arxiv.org/abs/2408.03014v1

Compressor summary: The paper proposes CKNN, a method for unsupervised video anomaly detection that filters out anomaly clusters in the training dataset, achieving better results than previous methods and comparable to those trained with anomaly-free data.


Fact Finder -- Enhancing Domain Expertise of Large Language Models by Incorporating Knowledge Graphs

http://arxiv.org/abs/2408.03010v1

Compressor summary: The authors propose a hybrid system that combines large language models with knowledge graphs to improve factual correctness and completeness in answering natural language queries, especially for medical applications.


Dual-path Collaborative Generation Network for Emotional Video Captioning

http://arxiv.org/abs/2408.03006v1

Compressor summary: The paper proposes a dual-path collaborative generation network that dynamically perceives and generates emotional captions for videos, balancing factual content and emotional cues.


DreamLCM: Towards High-Quality Text-to-3D Generation via Latent Consistency Model

http://arxiv.org/abs/2408.02993v1

Compressor summary: The paper proposes DreamLCM, a method to improve text-to-3D quality by incorporating a Latent Consistency Model, and introduces two strategies to enhance generation further.


A Differential Smoothness-based Compact-Dynamic Graph Convolutional Network for Spatiotemporal Signal Recovery

http://arxiv.org/abs/2408.02987v1

Compressor summary: The paper introduces a Compact-Dynamic Graph Convolutional Network (CDGCN) that uses a unified tensor graph convolution framework to simultaneously process spatial and temporal patterns for spatiotemporal signal recovery, achieving better results than existing models.


Diffusion Model Meets Non-Exemplar Class-Incremental Learning and Beyond

http://arxiv.org/abs/2408.02983v1

Compressor summary: Diffusion-based Feature Replay (DiffFR) is a simple and effective method for non-exemplar class-incremental learning that uses self-supervised learning and prototype calibration to reduce catastrophic forgetting and improve feature representation.


Empathy Level Alignment via Reinforcement Learning for Empathetic Response Generation

http://arxiv.org/abs/2408.02976v1

Compressor summary: The EmpRL framework uses reinforcement learning to generate empathetic responses in dialogue systems by training a T5 model with an empathy reward function based on three communication mechanisms.


Wave Interpolation Neural Operator: Interpolated Prediction of Electric Fields Across Untrained Wavelengths

http://arxiv.org/abs/2408.02971v1

Compressor summary: WINO is a fast and accurate surrogate solver that interpolates electric field predictions across a wide range of wavelengths using Fourier Group Convolution Shuffling and conditioning techniques.


EC-Guide: A Comprehensive E-Commerce Guide for Instruction Tuning and Quantization

http://arxiv.org/abs/2408.02970v1

Compressor summary: EC-Guide is a model-agnostic e-commerce guide that improves the performance of large language models in various tasks by tuning instructions and quantizing them, achieving top ranks at the Amazon KDD Cup'24.


Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

http://arxiv.org/abs/2408.02966v1

Compressor summary: The paper proposes a method to compress irregular point clouds using adaptive conditional probability modeling and a dual-layer architecture with implicit neural representation, achieving high rate-distortion performance, low complexity, and arbitrary-scale upsampling.


Data-Driven Stochastic Closure Modeling via Conditional Diffusion Model and Neural Operator

http://arxiv.org/abs/2408.02965v1

Compressor summary: The paper presents a new data-driven framework for building stochastic and non-local closure models for complex dynamical systems using conditional diffusion model and neural operator, improving their performance in real-world applications.


Accuracy and Consistency of LLMs in the Registered Dietitian Exam: The Impact of Prompt Engineering and Knowledge Retrieval

http://arxiv.org/abs/2408.02964v1

Compressor summary: The paper evaluates three large language models' accuracy and consistency in answering 1050 nutrition questions using different prompts and techniques, finding that GPT-4o with CoT-SC prompting performed best overall and Gemini 1.5 Pro with ZS had the highest consistency.


Anytime Multi-Agent Path Finding with an Adaptive Delay-Based Heuristic

http://arxiv.org/abs/2408.02960v1

Compressor summary: ADDRESS is a new MAPF-LNS variant that uses restricted Thompson Sampling to adaptively select promising destroy heuristics for large-scale multi-agent path finding, achieving cost improvements of 50% or more.


Online Temporal Action Localization with Memory-Augmented Transformer

http://arxiv.org/abs/2408.02957v1

Compressor summary: MATR is a memory-augmented transformer that leverages long-term context for online temporal action localization by selectively preserving past segment features and predicting action start and end times.


WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection

http://arxiv.org/abs/2408.02954v1

Compressor summary: FakeMix is a new benchmark for detecting manipulated segments in videos and audio, revealing limitations of existing deepfake detection models.


Are Female Carpenters like Blue Bananas? A Corpus Investigation of Occupation Gender Typicality

http://arxiv.org/abs/2408.02948v1

Compressor summary: The study explores if occupations have typical genders like bananas have typical colors, but finds that gender mentioning is more related to the femaleness of an occupation than typicality.


Self-Supervised Learning for Multi-Channel Neural Transducer

http://arxiv.org/abs/2408.02945v1

Compressor summary: The paper explores self-supervised learning with wav2vec 2.0 for multi-channel end-to-end ASR and finds feature-wise quantization to be the most effective pre-training method.


Achieving More with Less: A Tensor-Optimization-Powered Ensemble Method

http://arxiv.org/abs/2408.02936v1

Compressor summary: The paper proposes a method to improve ensemble learning by using confidence tensors to integrate weak base learners more efficiently and enhancing generalization performance with a smooth convex objective function.


Doubly Stochastic Adaptive Neighbors Clustering via the Marcus Mapping

http://arxiv.org/abs/2408.02932v1

Compressor summary: The paper proposes ANCMM, an algorithm that learns sparse, doubly stochastic similarity graphs for clustering problems using Marcus mapping, which relates to optimal transport efficiency.


The Need for a Big World Simulator: A Scientific Challenge for Continual Learning

http://arxiv.org/abs/2408.02930v1

Compressor summary: The text discusses how the "small agent, big world" frame motivates the need for continual learning and proposes two desiderata for designing better synthetic environments to evaluate agents' performance.


HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection

http://arxiv.org/abs/2408.02927v1

Compressor summary: The paper introduces HARMONIC, a new framework that uses large language models (LLMs) to generate realistic and private tabular data by fine-tuning them with k-nearest neighbors algorithm and evaluating their performance and privacy risks with specific metrics.


Intermediate direct preference optimization

http://arxiv.org/abs/2408.02923v1

Compressor summary: The paper introduces a new method called intermediate direct preference optimization (DPO) for fine-tuning large language models that uses losses from multiple intermediate layers to improve performance.


Data Checklist: On Unit-Testing Datasets with Usable Information

http://arxiv.org/abs/2408.02919v1

Compressor summary: The paper proposes a data checklist, a set of principled unit tests based on V-information literature, to systematically evaluate datasets for annotation artifacts and improve LLM alignment.


Leveraging Inter-Chunk Interactions for Enhanced Retrieval in Large Language Model-Based Question Answering

http://arxiv.org/abs/2408.02907v1

Compressor summary: IIER is a new framework for question-answering tasks that uses chunk interactions to enhance retrieval and improve performance.


Dual-View Pyramid Pooling in Deep Neural Networks for Improved Medical Image Classification and Confidence Calibration

http://arxiv.org/abs/2408.02906v1

Compressor summary: The paper proposes dual-view pyramid pooling (DVPP), a new method to aggregate features in deep neural networks for better medical image classification and confidence calibration by combining spatial pooling and cross-channel pooling.


Enabling Intelligent Traffic Systems: A Deep Learning Method for Accurate Arabic License Plate Recognition

http://arxiv.org/abs/2408.02904v1

Compressor summary: The paper presents a two-stage framework for accurate Egyptian license plate recognition using image processing and deep learning, with applications in traffic management.


Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

http://arxiv.org/abs/2408.02901v1

Compressor summary: Lighthouse is a user-friendly library for video moment retrieval and highlight detection, addressing the issues of lack of comprehensive experiments and user-unfriendly design in the research community.


SETN: Stock Embedding Enhanced with Textual and Network Information

http://arxiv.org/abs/2408.02899v1

Compressor summary: The paper proposes a method to represent stocks as vectors using both text and network data, and shows that it improves performance on related tasks and fund creation in wealth management.


A Metric Driven Approach to Mixed Precision Training

http://arxiv.org/abs/2408.02897v1

Compressor summary: The paper proposes a metric-driven approach to choosing low precision numerics for deep learning models, which can reduce hardware costs and enable scaling of training.


Diverse Generation while Maintaining Semantic Coordination: A Diffusion-Based Data Augmentation Method for Object Detection

http://arxiv.org/abs/2408.02891v1

Compressor summary: Our new technique uses conditional diffusion models to create diverse and semantically consistent augmented images for better object detection.


VizECGNet: Visual ECG Image Network for Cardiovascular Diseases Classification with Multi-Modal Training and Knowledge Distillation

http://arxiv.org/abs/2408.02888v1

Compressor summary: VizECGNet is a multi-modal deep learning model that uses only printed ECG images to diagnose cardiovascular diseases, outperforming signal-based models with higher precision, recall, and F1-Score.


Body of Her: A Preliminary Study on End-to-End Humanoid Agent

http://arxiv.org/abs/2408.02879v1

Compressor summary: The authors propose a realistic humanoid agent that can communicate using speech, body movements, and manipulate objects in real-time, integrating audio and visual inputs from large language model.