Audio-video summarization of TV news using speech recognition and shot change detection

Author(s):  
Chien-Lin Huang ◽  
Chia-Hsin Hsieh ◽  
Chung-Hsien Wu
2014 ◽  
Vol 44 (2) ◽  
pp. 175-184 ◽  
Author(s):  
Darryl Stewart ◽  
Rowan Seymour ◽  
Adrian Pass ◽  
Ji Ming

2012 ◽  
Vol 60 (2) ◽  
pp. 307-316 ◽  
Author(s):  
M. Kubanek ◽  
J. Bobulski ◽  
L. Adrjanowicz

Abstract. This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of the highly disturbed audio speech signal. Recognition of audio-visual speech was based on combined hidden Markov models (CHMM). The described methods were developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audiovisual speech recognition. The problem of a visual speech analysis is very difficult and computationally demanding, mostly because of an extreme amount of data that needs to be processed. Therefore, the method of audio-video speech recognition is used only while the audiospeech signal is exposed to a considerable level of distortion. There are proposed the authors’ own methods of the lip edges detection and a visual characteristic extraction in this paper. Moreover, the method of fusing speech characteristics for an audio-video signal was proposed and tested. A significant increase of recognition effectiveness and processing speed were noted during tests - for properly selected CHMM parameters and an adequate codebook size, besides the use of the appropriate fusion of audio-visual characteristics. The experimental results were very promising and close to those achieved by leading scientists in the field of audio-visual speech recognition.


Author(s):  
Hrishikesh Bhaumik ◽  
Siddhartha Bhattacharyya ◽  
Susanta Chakraborty

Over the past decade, research in the field of Content-Based Video Retrieval Systems (CBVRS) has attracted much attention as it encompasses processing of all the other media types i.e. text, image and audio. Video summarization is one of the most important applications as it potentially enables efficient and faster browsing of large video collections. A concise version of the video is often required due to constraints in viewing time, storage, communication bandwidth as well as power. Thus, the task of video summarization is to effectively extract the most important portions of the video, without sacrificing the semantic information in it. The results of video summarization can be used in many CBVRS applications like semantic indexing, video surveillance copied video detection etc. However, the quality of the summarization task depends on two basic aspects: content coverage and redundancy removal. These two aspects are both important and contradictory to each other. This chapter aims to provide an insight into the state-of-the-art approaches used for this booming field of research.


2021 ◽  
Vol 11 (11) ◽  
pp. 5260
Author(s):  
Theodoros Psallidas ◽  
Panagiotis Koromilas ◽  
Theodoros Giannakopoulos ◽  
Evaggelos Spyrou

The exponential growth of user-generated content has increased the need for efficient video summarization schemes. However, most approaches underestimate the power of aural features, while they are designed to work mainly on commercial/professional videos. In this work, we present an approach that uses both aural and visual features in order to create video summaries from user-generated videos. Our approach produces dynamic video summaries, that is, comprising the most “important” parts of the original video, which are arranged so as to preserve their temporal order. We use supervised knowledge from both the aforementioned modalities and train a binary classifier, which learns to recognize the important parts of videos. Moreover, we present a novel user-generated dataset which contains videos from several categories. Every 1 sec part of each video from our dataset has been annotated by more than three annotators as being important or not. We evaluate our approach using several classification strategies based on audio, video and fused features. Our experimental results illustrate the potential of our approach.


Entropy ◽  
2018 ◽  
Vol 20 (10) ◽  
pp. 748 ◽  
Author(s):  
Shazia Saqib ◽  
Syed Kazmi

Multimedia information requires large repositories of audio-video data. Retrieval and delivery of video content is a very time-consuming process and is a great challenge for researchers. An efficient approach for faster browsing of large video collections and more efficient content indexing and access is video summarization. Compression of data through extraction of keyframes is a solution to these challenges. A keyframe is a representative frame of the salient features of the video. The output frames must represent the original video in temporal order. The proposed research presents a method of keyframe extraction using the mean of consecutive k frames of video data. A sliding window of size k / 2 is employed to select the frame that matches the median entropy value of the sliding window. This is called the Median of Entropy of Mean Frames (MME) method. MME is mean-based keyframes selection using the median of the entropy of the sliding window. The method was tested for more than 500 videos of sign language gestures and showed satisfactory results.


Sign in / Sign up

Export Citation Format

Share Document