Audio-video summarization of TV news using speech recognition and shot change detection

Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions

IEEE Transactions on Cybernetics ◽

10.1109/tcyb.2013.2250954 ◽

2014 ◽

Vol 44 (2) ◽

pp. 175-184 ◽

Cited By ~ 26

Author(s):

Darryl Stewart ◽

Rowan Seymour ◽

Adrian Pass ◽

Ji Ming

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Speech Recognition ◽

Audio Video

Download Full-text

Characteristics of the use of coupled hidden Markov models for audio-visual polish speech recognition

Bulletin of the Polish Academy of Sciences Technical Sciences ◽

10.2478/v10175-012-0041-6 ◽

2012 ◽

Vol 60 (2) ◽

pp. 307-316 ◽

Cited By ~ 3

Author(s):

M. Kubanek ◽

J. Bobulski ◽

L. Adrjanowicz

Keyword(s):

Speech Recognition ◽

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Visual Speech ◽

Visual Signals ◽

Visual Speech Recognition ◽

Coupled Hidden Markov Models ◽

Visual Characteristic ◽

Audio Video

Abstract. This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of the highly disturbed audio speech signal. Recognition of audio-visual speech was based on combined hidden Markov models (CHMM). The described methods were developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audiovisual speech recognition. The problem of a visual speech analysis is very difficult and computationally demanding, mostly because of an extreme amount of data that needs to be processed. Therefore, the method of audio-video speech recognition is used only while the audiospeech signal is exposed to a considerable level of distortion. There are proposed the authors’ own methods of the lip edges detection and a visual characteristic extraction in this paper. Moreover, the method of fusing speech characteristics for an audio-video signal was proposed and tested. A significant increase of recognition effectiveness and processing speed were noted during tests - for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics. The experimental results were very promising and close to those achieved by leading scientists in the field of audio-visual speech recognition.

Download Full-text

Automated MPEG audio-video summarization and description

Proceedings International Conference on Image Processing ICIP-02 ◽

10.1109/icip.2002.1038186 ◽

2003 ◽

Cited By ~ 8

Author(s):

M. Sugano ◽

Y. Nakajima ◽

H. Yanagihara

Keyword(s):

Video Summarization ◽

Audio Video

Download Full-text

Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition

2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221) ◽

10.1109/icassp.2001.940855 ◽

2002 ◽

Cited By ~ 10

Author(s):

K. Mori ◽

S. Nakagawa

Keyword(s):

Speech Recognition ◽

Change Detection ◽

Broadcast News ◽

Speaker Clustering ◽

Speaker Change Detection

Download Full-text

Content Coverage and Redundancy Removal in Video Summarization

Intelligent Analysis of Multimedia Information - Advances in Multimedia and Interactive Technologies ◽

10.4018/978-1-5225-0498-6.ch013 ◽

2017 ◽

pp. 352-374

Author(s):

Hrishikesh Bhaumik ◽

Siddhartha Bhattacharyya ◽

Susanta Chakraborty

Keyword(s):

Video Retrieval ◽

Video Summarization ◽

Viewing Time ◽

Redundancy Removal ◽

Retrieval Systems ◽

Media Types ◽

Content Coverage ◽

Video Retrieval Systems ◽

Audio Video

Over the past decade, research in the field of Content-Based Video Retrieval Systems (CBVRS) has attracted much attention as it encompasses processing of all the other media types i.e. text, image and audio. Video summarization is one of the most important applications as it potentially enables efficient and faster browsing of large video collections. A concise version of the video is often required due to constraints in viewing time, storage, communication bandwidth as well as power. Thus, the task of video summarization is to effectively extract the most important portions of the video, without sacrificing the semantic information in it. The results of video summarization can be used in many CBVRS applications like semantic indexing, video surveillance copied video detection etc. However, the quality of the summarization task depends on two basic aspects: content coverage and redundancy removal. These two aspects are both important and contradictory to each other. This chapter aims to provide an insight into the state-of-the-art approaches used for this booming field of research.

Download Full-text

An audio-video summarization scheme based on audio and video analysis

CCNC 2006. 2006 3rd IEEE Consumer Communications and Networking Conference, 2006. ◽

10.1109/ccnc.2006.1593230 ◽

2006 ◽

Cited By ~ 15

Author(s):

M. Furini ◽

V. Ghini

Keyword(s):

Video Analysis ◽

Video Summarization ◽

Audio Video

Download Full-text

Multimodal Summarization of User-Generated Videos

Applied Sciences ◽

10.3390/app11115260 ◽

2021 ◽

Vol 11 (11) ◽

pp. 5260

Author(s):

Theodoros Psallidas ◽

Panagiotis Koromilas ◽

Theodoros Giannakopoulos ◽

Evaggelos Spyrou

Keyword(s):

Exponential Growth ◽

Temporal Order ◽

Video Summarization ◽

User Generated Content ◽

Visual Features ◽

Original Video ◽

Binary Classifier ◽

Video Summaries ◽

Audio Video ◽

Efficient Video

The exponential growth of user-generated content has increased the need for efficient video summarization schemes. However, most approaches underestimate the power of aural features, while they are designed to work mainly on commercial/professional videos. In this work, we present an approach that uses both aural and visual features in order to create video summaries from user-generated videos. Our approach produces dynamic video summaries, that is, comprising the most “important” parts of the original video, which are arranged so as to preserve their temporal order. We use supervised knowledge from both the aforementioned modalities and train a binary classifier, which learns to recognize the important parts of videos. Moreover, we present a novel user-generated dataset which contains videos from several categories. Every 1 sec part of each video from our dataset has been annotated by more than three annotators as being important or not. We evaluate our approach using several classification strategies based on audio, video and fused features. Our experimental results illustrate the potential of our approach.

Download Full-text

Audio and text synchronization for TV news subtitling based on Automatic Speech Recognition

2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting ◽

10.1109/isbmsb.2009.5133758 ◽

2009 ◽

Cited By ~ 2

Author(s):

Jose Enrique Garcia ◽

Alfonso Ortega ◽

Eduardo Lleida ◽

Tomas Lozano ◽

Emiliano Bernues ◽

...

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Tv News

Download Full-text

Video summarization through change detection in a non-overlapping camera network

2015 IEEE International Conference on Image Processing (ICIP) ◽

10.1109/icip.2015.7351522 ◽

2015 ◽

Cited By ~ 5

Author(s):

Shu Zhang ◽

Amit K. Roy-Chowdhury

Keyword(s):

Change Detection ◽

Video Summarization ◽

Camera Network

Download Full-text

Video Summarization for Sign Languages Using the Median of Entropy of Mean Frames Method

Entropy ◽

10.3390/e20100748 ◽

2018 ◽

Vol 20 (10) ◽

pp. 748 ◽

Cited By ~ 3

Author(s):

Shazia Saqib ◽

Syed Kazmi

Keyword(s):

Temporal Order ◽

Video Summarization ◽

Sliding Window ◽

Data Retrieval ◽

Video Data ◽

Original Video ◽

Video Content ◽

Sign Languages ◽

The Mean ◽

Audio Video

Multimedia information requires large repositories of audio-video data. Retrieval and delivery of video content is a very time-consuming process and is a great challenge for researchers. An efficient approach for faster browsing of large video collections and more efficient content indexing and access is video summarization. Compression of data through extraction of keyframes is a solution to these challenges. A keyframe is a representative frame of the salient features of the video. The output frames must represent the original video in temporal order. The proposed research presents a method of keyframe extraction using the mean of consecutive k frames of video data. A sliding window of size k / 2 is employed to select the frame that matches the median entropy value of the sliding window. This is called the Median of Entropy of Mean Frames (MME) method. MME is mean-based keyframes selection using the median of the entropy of the sliding window. The method was tested for more than 500 videos of sign language gestures and showed satisfactory results.

Download Full-text