Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

1Gwangju Institute of Science and Technology 2Seoul National University
Project Teaser

Dual-Pathway Audio Encoders for Video Highlight Detection(DAViHD) is a novel audio-visual highlight detection model that utilizes a dual-pathway audio encoder to capture the semantic content and spectro-temporal dynamics of sound.


Abstract

Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the TVSum and Mr.HiSum benchmarks. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.


Interactive Demos 🔉

We compare our model with the baseline [1] on random videos from YouTube. The results show that our model efficiently captures both audio and visual cues for highlight detection. The ground truth highlight scores provided by YouTube's 'Most Replayed' statistics and the predicted highlight scores of each model are plotted for each video.

1978 WORLD CUP FINAL: Argentina 3-1 Netherlands
YouTube Link

Killer whales are so clever | Frozen Planet II - BBC
YouTube Link

Huge wave shatters ferry window as Storm Ylenia batters Germany
YouTube Link

Wingsuit Flight - straight & steep line

YouTube Link


References

[1] Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng, "Joint visual and audio learning for video highlight detection", in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8127-8137.