Multitask Learning with Local Attention for Tibetan Speech Recognition

Complexity ◽

10.1155/2020/8894566 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Hui Wang ◽

Fei Gao ◽

Yue Zhao ◽

Li Yang ◽

Jianjian Yue ◽

...

Keyword(s):

Speech Recognition ◽

Speaker Recognition ◽

Multitask Learning ◽

Experimental Results ◽

Context Information ◽

Accuracy Rate ◽

Baseline Model ◽

Content Recognition ◽

Tibetan Dialects ◽

Speech Content

In this paper, we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speech recognition in multitask learning. With an increase in task number, such as simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition, the accuracy rate of a single WaveNet-CTC decreases on speech recognition. Inspired by the attention mechanism, we introduce the local attention to automatically tune the weights of feature frames in a window and pay different attention on context information for multitask learning. The experimental results show that our method improves the accuracies of speech recognition for all Tibetan dialects in three-task learning, compared with the baseline model. Furthermore, our method significantly improves the accuracy for low-resource dialect by 5.11% against the specific-dialect model.

Download Full-text

Nexus DNN for Speech and Speaker Recognition

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b2963.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 2004-2007

Keyword(s):

Speech Recognition ◽

Open Source ◽

Automatic Speech Recognition ◽

Speaker Recognition ◽

Unified Model ◽

Experimental Results ◽

Combined Model ◽

Close Relationship

Over the years, many efforts have been made on improving recognition accuracies on Automatic speech recognition (ASR) and speaker recognition (SRE), and many different technologies have been developed. Given the close relationship between these two tasks, researchers have proposed different ways to introduce techniques developed for these tasks to each other. In this paper an open source experimental framework is proposed for speech and speaker recognition. Then a unified model, Nexus-DNN is developed that is trained jointly for speech and speaker recognition. Experimental results show that the combined model can effectively perform ASR and SRE tasks.

Download Full-text

Distant speech recognition for home automation: Preliminary experimental results in a smart home

2011 6th Conference on Speech Technology and Human-Computer Dialogue (SpeD) ◽

10.1109/sped.2011.5940728 ◽

2011 ◽

Cited By ~ 6

Author(s):

Benjamin Lecouteux ◽

Michel Vacher ◽

Francois Portet

Keyword(s):

Speech Recognition ◽

Smart Home ◽

Experimental Results ◽

Home Automation

Download Full-text

FINITELY SUBSEQUENTIAL TRANSDUCERS

International Journal of Foundations of Computer Science ◽

10.1142/s0129054103002126 ◽

2003 ◽

Vol 14 (06) ◽

pp. 983-994 ◽

Cited By ~ 9

Author(s):

CYRIL ALLAUZEN ◽

MEHRYAR MOHRI

Keyword(s):

Speech Recognition ◽

Finite Number ◽

Efficient Algorithm ◽

Experimental Results ◽

Theoretical Formulation ◽

Large Vocabulary ◽

Finite State Transducers ◽

Finite State ◽

Large Vocabulary Speech Recognition

Finitely subsequential transducers are efficient finite-state transducers with a finite number of final outputs and are used in a variety of applications. Not all transducers admit equivalent finitely subsequential transducers however. We briefly describe an existing generalized determinization algorithm for finitely subsequential transducers and give the first characterization of finitely subsequentiable transducers, transducers that admit equivalent finitely subsequential transducers. Our characterization shows the existence of an efficient algorithm for testing finite subsequentiability. We have fully implemented the generalized determinization algorithm and the algorithm for testing finite subsequentiability. We report experimental results showing that these algorithms are practical in large-vocabulary speech recognition applications. The theoretical formulation of our results is the equivalence of the following three properties for finite-state transducers: determinizability in the sense of the generalized algorithm, finite subsequentiability, and the twins property.

Download Full-text

A Diagonally Weighted Binary Memristor Crossbar Architecture Based on Multilayer Neural Network for Better Accuracy Rate in Speech Recognition Application

Advances in Electrical and Computer Engineering ◽

10.4316/aece.2019.02010 ◽

2019 ◽

Vol 19 (2) ◽

pp. 75-82

Author(s):

M.-H. VO

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Accuracy Rate ◽

Crossbar Architecture ◽

Memristor Crossbar

Download Full-text

Remote Judgment Method of Painting Image Style Plagiarism Based on Wireless Network Multitask Learning

Wireless Communications and Mobile Computing ◽

10.1155/2021/1345974 ◽

2021 ◽

Vol 2021 ◽

pp. 1-8

Author(s):

Zhijun Wang

Keyword(s):

Wireless Network ◽

Kernel Function ◽

Digital Image ◽

Learning Algorithm ◽

Deep Level ◽

Multitask Learning ◽

Experimental Results ◽

Spatial Consistency ◽

Discrimination Rate ◽

Edge Sampling

Since the artistry of the work cannot be accurately described, the identification of reproducible plagiarism is more difficult. The identification of reproducible plagiarism of digital image works requires in-depth research on the artistry of artistic works. In this paper, a remote judgment method for plagiarism of painting image style based on wireless network multitask learning is proposed. According to this new method, the uncertainty of painting image samples is removed based on multitask learning algorithm edge sampling. The deep-level details of the painting image are extracted through the multitask classification kernel function, and most of the pixels in the image are eliminated. When the clustering density is greater than the judgment threshold, it can be considered that the two images have spatial consistency. It can also be judged based on this that the two images are similar, that is, there is plagiarism in the painting. The experimental results show that the discrimination rate is always close to 100%, the misjudgment rate of plagiarism of painting images has been reduced, and the various indicators in the discrimination process are the lowest, which fully shows that a very satisfactory discrimination result can be obtained.

Download Full-text

Research on Speaker Recognition of DRNN in Different Noise Environment

10.21203/rs.3.rs-124941/v1 ◽

2020 ◽

Author(s):

chaofeng lan ◽

yuanyuan Zhang ◽

hongyun Zhao

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Recurrent Neural Network ◽

Speaker Recognition ◽

Signal To Noise Ratio ◽

Recognition Rate ◽

Noisy Environment ◽

Signal To Noise ◽

Noise Ratio ◽

Improved Model

Abstract This paper draws on the training method of Recurrent Neural Network (RNN), By increasing the number of hidden layers of RNN and changing the layer activation function from traditional Sigmoid to Leaky ReLU on the input layer, the first group and the last set of data are zero-padded to enhance the effective utilization of data such that the improved reduction model of Denoise Recurrent Neural Network (DRNN) with high calculation speed and good convergence is constructed to solve the problem of low speaker recognition rate in noisy environment. According to this model, the random semantic speech signal with a sampling rate of 16 kHz and a duration of 5 seconds in the speech library is studied. The experimental settings of the signal-to-noise ratios are − 10dB, -5dB, 0dB, 5dB, 10dB, 15dB, 20dB, 25dB. In the noisy environment, the improved model is used to denoise the Mel Frequency Cepstral Coefficients (MFCC) and the Gammatone Frequency Cepstral Coefficents (GFCC), impact of the traditional model and the improved model on the speech recognition rate is analyzed. The research shows that the improved model can effectively eliminate the noise of the feature parameters and improve the speech recognition rate. When the signal-to-noise ratio is low, the speaker recognition rate can be more obvious. Furthermore, when the signal-to-noise ratio is 0dB, the speaker recognition rate of people is increased by 40%, which can be 85% improved compared with the traditional speech model. On the other hand, with the increase in the signal-to-noise ratio, the recognition rate is gradually increased. When the signal-to-noise ratio is 15dB, the recognition rate of speakers is 93%.

Download Full-text

Voice Controlled Vehicle Dashboard

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a2148.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 1022-1027

Keyword(s):

Speech Recognition ◽

Speaker Recognition ◽

Feature Matching ◽

Heavy Traffic ◽

Recognition Rate ◽

Recognition System ◽

Low Noise ◽

Raspberry Pi ◽

Mel Frequency Cepstral Coefficients ◽

Power Spectral

Driving a vehicle or a car has become tedious job nowadays due to heavy traffic so focus on driving is utmost important. This makes a scope for automation in Automobiles in minimizing human intervention in controlling the dashboard functions such as Headlamps, Indicators, Power window, Wiper System, and to make it possible this is a small effort from this paper to make driving distraction free using Voice controlled dashboard. and system proposed in this paper works on speech commands from the user (Driver or Passenger). As Speech Recognition system acts Human machine Interface (HMI) in this system hence this system makes use of Speaker recognition and Speech recognition for recognizing the command and recognize whether the command is coming from authenticated user(Driver or Passenger). System performs Feature Extraction and extracts speech features such Mel Frequency Cepstral Coefficients(MFCC),Power Spectral Density(PSD),Pitch, Spectrogram. Then further for Feature matching system uses Vector Quantization Linde Buzo Gray(VQLBG) algorithm. This algorithm makes use of Euclidean distance for calculating the distance between test feature and codebook feature. Then based on speech command recognized controller (Raspberry Pi-3b) activates the device driver for motor, Solenoid valve depending on function. This system is mainly aimed to work in low noise environment as most speech recognition systems suffer when noise is introduced. When it comes to speech recognition acoustics of the room matters a lot as recognition rate differs depending on acoustics. when several testing and simulation trials were taken for testing, system has speech recognition rate of 76.13%. This system encourages Automation of vehicle dashboard and hence making driving Distraction Free.

Download Full-text

Comparison of Voice-Automated Transcription and Human Transcription in Generating Pathology Reports

Archives of Pathology & Laboratory Medicine ◽

10.5858/2003-127-721-covtah ◽

2003 ◽

Vol 127 (6) ◽

pp. 721-725

Author(s):

Maamoun M. Al-Aynati ◽

Katherine A. Chorneyko

Keyword(s):

Speech Recognition ◽

Voice Recognition ◽

Computer Software ◽

Continuous Speech ◽

Continuous Speech Recognition ◽

Accuracy Rate ◽

Written Text ◽

Computer Based ◽

Pathology Reports ◽

Recognition Software

Abstract Context.—Software that can convert spoken words into written text has been available since the early 1980s. Early continuous speech systems were developed in 1994, with the latest commercially available editions having a claimed accuracy of up to 98% of speech recognition at natural speech rates. Objectives.—To evaluate the efficacy of one commercially available voice-recognition software system with pathology vocabulary in generating pathology reports and to compare this with human transcription. To draw cost analysis conclusions regarding human versus computer-based transcription. Design.—Two hundred six routine pathology reports from the surgical pathology material handled at St Joseph's Healthcare, Hamilton, Ontario, were generated simultaneously using computer-based transcription and human transcription. The following hardware and software were used: a desktop 450-MHz Intel Pentium III processor with 192 MB of RAM, a speech-quality sound card (Sound Blaster), noise-canceling headset microphone, and IBM ViaVoice Pro version 8 with pathology vocabulary support (Voice Automated, Huntington Beach, Calif). The cost of the hardware and software used was approximately Can $2250. Results.—A total of 23 458 words were transcribed using both methods with a mean of 114 words per report. The mean accuracy rate was 93.6% (range, 87.4%–96%) using the computer software, compared to a mean accuracy of 99.6% (range, 99.4%–99.8%) for human transcription (P < .001). Time needed to edit documents by the primary evaluator (M.A.) using the computer was on average twice that needed for editing the documents produced by human transcriptionists (range, 1.4–3.5 times). The extra time needed to edit documents was 67 minutes per week (13 minutes per day). Conclusions.—Computer-based continuous speech-recognition systems in pathology can be successfully used in pathology practice even during the handling of gross pathology specimens. The relatively low accuracy rate of this voice-recognition software with resultant increased editing burden on pathologists may not encourage its application on a wide scale in pathology departments with sufficient human transcription services, despite significant potential financial savings. However, computer-based transcription represents an attractive and relatively inexpensive alternative to human transcription in departments where there is a shortage of transcription services, and will no doubt become more commonly used in pathology departments in the future.

Download Full-text

A Study on the Robustness of Pitch-Range Estimation from Brief Speech Segments

International Journal of Asian Language Processing ◽

10.1142/s2717554520500034 ◽

2020 ◽

Vol 30 (01) ◽

pp. 2050003

Author(s):

Wenjie Peng ◽

Kaiqi Fu ◽

Wei Zhang ◽

Yanlu Xie ◽

Jinsong Zhang

Keyword(s):

Speaker Recognition ◽

Native Speakers ◽

Estimation Method ◽

Experimental Results ◽

Percentage Error ◽

Range Estimation ◽

Pitch Range ◽

Speech Segment ◽

The Mean ◽

Speech Segments

Pitch-range estimation from brief speech segments could bring benefits to many tasks like automatic speech recognition and speaker recognition. To estimate pitch range, previous studies have proposed to utilize deep-learning-based models with spectrum information as input. They demonstrated that such method works and could still achieve reliable estimation results when the speech segment is as brief as 300 ms. In this study, we evaluated the robustness of this method. We take the following scenarios into account: (1) a large number of training speakers; (2) different language backgrounds; and (3) monosyllabic utterances with different tones. Experimental results showed that: (1) The use of a large number of training speakers improved the estimation accuracies. (2) The mean absolute percentage error (MAPE) rate evaluated on the L2 speakers is similar to that on the native speakers. (3) Different tonal information will affect the LSTM-based model, but this influence is limited compared to the baseline method which calculates pitch-range targets from the distribution of [Formula: see text]0 values. These experimental results verified the efficiency of the LSTM-based pitch-range estimation method.

Download Full-text

The Algorithm of Sense Disambiguation Based on Bayesian Model

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.427-429.1879 ◽

2013 ◽

Vol 427-429 ◽

pp. 1879-1882

Author(s):

Chun Xiang Zhang ◽

Xue Yao Gao ◽

Zhi Mao Lu

Keyword(s):

Pattern Recognition ◽

Bayesian Model ◽

Ambiguous Word ◽

Experimental Results ◽

Accuracy Rate ◽

Part Of Speech ◽

Sense Disambiguation ◽

The Right

Sense disambiguation is an important problem in pattern recognition. In this paper, a new algorithm of sense disambiguation is proposed, in which part-of-speech tags of the left word and the right word around the ambiguous word are extracted as discriminative features. At the same time, the bayesian model is selected as the sense disambiguation classifier and it is built based on discriminative features. The architecture of sense classification is given. The new algorithm is trained on sense-annotated corpus. Then it is used to determine its sense category. Experimental results show that the accuracy rate of disambiguation arrives at 60%.

Download Full-text