AI 语音日报 - 2026年2月26日星期四

TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition

cs.SD 👤 Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen

Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages.

📄 下载 PDF

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs

cs.SD 👤 Wenjie Tian, Zhixian Zhao, Jingbin Hu

The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response.

📄 下载 PDF

A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation

eess.AS 👤 Chun-wei Ho, Sabato Marco Siniscalchi, Kai Li

We propose a knowledge-driven, model-based approach to segmenting audio into single-category and mixed-category chunks with applications to source separation.

📄 下载 PDF

iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis

eess.AS 👤 Sofoklis Kakouros, Fang Kang, Haoyu Chen

This work presents iMiGUE-Speech, an extension of the iMiGUE dataset that provides a spontaneous affective corpus for studying emotional and affective states.

📄 下载 PDF

Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

cs.SD 👤 Haoyang Li, Changsong Liu, Wei Rao

Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm...

📄 下载 PDF

Geometric Analysis of Speech Representation Spaces: Topological Disentanglement and Confound Detection

cs.SD 👤 Bipasha Kashyap, Pubudu N. Pathirana

Speech-based clinical tools are increasingly deployed in multilingual settings, yet whether pathological speech markers remain geometrically separable from accent variation remains unclear.

📄 下载 PDF

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

cs.SD 👤 Fang-Duo Tsai, Yi-An Lai, Fei-Yueh Chen

Song generation aims to produce full songs with vocals and accompaniment from lyrics and text descriptions, yet end-to-end models remain data- and compute-intensive and provide limited editability.

📄 下载 PDF

Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

cs.SD 👤 Bipasha Kashyap, Björn W. Schuller, Pubudu N. Pathirana

Speech signals encode emotional, linguistic, and pathological information within a shared acoustic channel; however, disentanglement is typically assessed indirectly through downstream task...

📄 下载 PDF

Memory-guided Prototypical Co-occurrence Learning for Mixed Emotion Recognition

cs.SD 👤 Ming Li, Yong-Jin Liu, Fang Liu

Emotion recognition from multi-modal physiological and behavioral signals plays a pivotal role in affective computing, yet most existing models remain constrained to the prediction of singular...

📄 下载 PDF

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

cs.SD 👤 Yuxuan Chen, Peize He, Haoyuan Xu

A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder.

📄 下载 PDF

Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

cs.SD 👤 MD. Sagor Chowdhury, Adiba Fairooz Chowdhury

We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.

📄 下载 PDF

823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio

cs.SD 👤 Ratnajit Dhar, Arpita Mallik

Bengali, despite being one of the most widely spoken languages globally, remains underrepresented in long form speech technology, particularly in systems addressing transcription and speaker...

📄 下载 PDF

Assessing the Impact of Speaker Identity in Speech Spoofing Detection

cs.SD 👤 Anh-Tuan Dao, Driss Matrouf, Nicholas Evans

Spoofing detection systems are typically trained using diverse recordings from multiple speakers, often assuming that the resulting embeddings are independent of speaker identity.

📄 下载 PDF

Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

cs.SD 👤 Darvan Shvan Khairaldeen, Hossein Hassani

Maqam, a singing type, is a significant component of Kurdish music. A maqam singer receives training in a traditional face-to-face or through self-training.

📄 下载 PDF

🎤 AI 语音日报

📰 AI 前沿资讯 0

🎤 语音前沿论文 14

TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs

A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation

iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis

Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

Geometric Analysis of Speech Representation Spaces: Topological Disentanglement and Confound Detection

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

Quantifying Dimensional Independence in Speech: An Information-Theoretic Framework for Disentangled Representation Learning

Memory-guided Prototypical Co-occurrence Learning for Mixed Emotion Recognition

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization

823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio

Assessing the Impact of Speaker Identity in Speech Spoofing Detection

Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

👥 关注博主动态 0

📅 历史日报