A new front-end for classification of non-speech sounds: a study on human whistle

<div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p><span>The idea of recognizing human emotion through speech (SER) has recently received considerable attention from the research community, mostly due to the current machine learning trend. Nevertheless, even the most successful methods are still rather lacking in terms of adaptation to specific speakers and scenarios, evidently reducing their performance when compared to humans. In this paper, we evaluate a largescale machine learning model for classification of emotional states. This model has been trained for speaker iden- tification but is instead used here as a front-end for extracting robust features from emotional speech. We aim to verify that SER improves when some speak- er</span><span>’</span><span>s emotional prosody cues are considered. Experiments using various state-of- the-art classifiers are carried out, using the Weka software, so as to evaluate the robustness of the extracted features. Considerable improvement is observed when comparing our results with other SER state-of-the-art techniques.</span></p></div></div></div>

Download Full-text

Dissociating Face Processing Skills: Decisions about Lip read Speech, Expression, and Identity

The Quarterly Journal of Experimental Psychology Section A ◽

10.1080/713755619 ◽

1996 ◽

Vol 49 (2) ◽

pp. 295-314 ◽

Cited By ~ 30

Author(s):

Ruth Campbell ◽

Barbara Brooks ◽

Edward de Haan ◽

Tony Roberts

Keyword(s):

Face Processing ◽

Face Identification ◽

Speech Sounds ◽

Manual Responses ◽

Lip Reading ◽

Identity Matching ◽

Unfamiliar Face ◽

Identity Based ◽

Series Of Experiments

The separability of different subcomponents of face processing has been regularly affirmed, but not always so clearly demonstrated. In particular, the ability to extract speech from faces (lip-reading) has been shown to dissociate doubly from face identification in neurological but not in other populations. In this series of experiments with undergraduates, the classification of speech sounds (lip-reading) from personally familiar and unfamiliar face photographs was explored using speeded manual responses. The independence of lip-reading from identity-based processing was confirmed. Furthermore, the established pattern of independence of expression-matching from, and dependence of identity-matching on, face familiarity was extended to personally familiar faces and “difficult”-emotion decisions. The implications of these findings are discussed.

Download Full-text

Visual classification of speech sounds

Clinical Linguistics & Phonetics ◽

10.3109/02699209008985486 ◽

1990 ◽

Vol 4 (3) ◽

pp. 247-252 ◽

Cited By ~ 2

Author(s):

Bozydar L. J. Kaczmarek

Keyword(s):

Speech Sounds ◽

Visual Classification

Download Full-text

Classification of Malay speech sounds based on place of articulation and voicing using neural networks

Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239) ◽

10.1109/tencon.2001.949574 ◽

2002 ◽

Cited By ~ 6

Author(s):

Ting Hua Nong ◽

J. Yunus ◽

S.H.S. Salleh

Keyword(s):

Neural Networks ◽

Speech Sounds ◽

Place Of Articulation

Download Full-text

Approach to More Efficient Segmentation and Classification of Speech Sounds

The Journal of the Acoustical Society of America ◽

10.1121/1.1970405 ◽

1968 ◽

Vol 44 (1) ◽

pp. 366-366

Author(s):

William J. Beninghof ◽

Myron Jay Ross

Keyword(s):

Speech Sounds

Download Full-text

A Low-Compexity Deep Learning FrameworkFor Acoustic Scene Classification

10.31219/osf.io/cmgws ◽

2021 ◽

Author(s):

Lam Pham ◽

Hieu Tang ◽

Anahid Jalal ◽

Alexander Schindler ◽

Ross King

Keyword(s):

Audio Signal ◽

Low Complexity ◽

Scene Classification ◽

Late Fusion ◽

Classification Result ◽

Urban Scenes ◽

Model Compression ◽

Front End ◽

Learning Frameworks

In this paper, we presents a low-complexitydeep learning frameworks for acoustic scene classification(ASC). The proposed framework can be separated into threemain steps: Front-end spectrogram extraction, back-endclassification, and late fusion of predicted probabilities.First, we use Mel filter, Gammatone filter and ConstantQ Transfrom (CQT) to transform raw audio signal intospectrograms, where both frequency and temporal featuresare presented. Three spectrograms are then fed into threeindividual back-end convolutional neural networks (CNNs),classifying into ten urban scenes. Finally, a late fusion ofthree predicted probabilities obtained from three CNNs isconducted to achieve the final classification result. To reducethe complexity of our proposed CNN network, we applytwo model compression techniques: model restriction anddecomposed convolution. Our extensive experiments, whichare conducted on DCASE 2021 (IEEE AASP Challenge onDetection and Classification of Acoustic Scenes and Events)Task 1A development dataset, achieve a low-complexity CNNbased framework with 128 KB trainable parameters andthe best classification accuracy of 66.7%, improving DCASEbaseline by 19.0%.

Download Full-text

Classification of Japanese Syllables Including Speech Sounds Found in Loanwords

Recent Research Towards Advanced Man-Machine Interface Through Spoken Language ◽

10.1016/b978-044481607-8/50084-0 ◽

1996 ◽

pp. 471-478

Author(s):

Shizuo Hiki

Keyword(s):

Speech Sounds

Download Full-text

Reduced Accuracy of Classification with a Short Form of the Speech Sounds Perception Test

Perceptual and Motor Skills ◽

10.2466/pms.1981.52.3.1003 ◽

1981 ◽

Vol 52 (3) ◽

pp. 1003-1006 ◽

Cited By ~ 3

Author(s):

W. G. Snow ◽

S. Sheese

Keyword(s):

Short Form ◽

Speech Sounds ◽

Long Form

This study attempted to cross-validate the accuracy in classification of Golden and Anderson's (1977) abbreviated version of the Halstead Speech Sounds Perception Test. A relatively high correlation was obtained between their short form and the standard long form for a sample of 150 patients, aged 15 to 70 yr. However, the long form was more accurate in discriminating between 31 brain-damaged and 31 nonbrain-damaged patients. These results suggest that the use of the short form of this test may reduce accuracy of classification.

Download Full-text