Fish recognition in underwater environments using deep learning and audio data

Author(s):  
Jean-François Laplante ◽  
Moulay A. Akhloufi ◽  
Cédric Gervaise
Keyword(s):  
2021 ◽  
Vol 263 (2) ◽  
pp. 4441-4445
Author(s):  
Hyunsuk Huh ◽  
Seungchul Lee

Audio data acquired at industrial manufacturing sites often include unexpected background noise. Since the performance of data-driven models can be worse by background noise. Therefore, it is important to get rid of unwanted background noise. There are two main techniques for noise canceling in a traditional manner. One is Active Noise Canceling (ANC), which generates an inverted phase of the sound that we want to remove. The other is Passive Noise Canceling (PNC), which physically blocks the noise. However, these methods require large device size and expensive cost. Thus, we propose a deep learning-based noise canceling method. This technique was developed using audio imaging technique and deep learning segmentation network. However, the proposed model only needs the information on whether the audio contains noise or not. In other words, unlike the general segmentation technique, a pixel-wise ground truth segmentation map is not required for this method. We demonstrate to evaluate the separation using pump sound of MIMII dataset, which is open-source dataset.


2021 ◽  
Vol 3 ◽  
Author(s):  
Sushovan Chanda ◽  
Kedar Fitwe ◽  
Gauri Deshpande ◽  
Björn W. Schuller ◽  
Sachin Patel

Research on self-efficacy and confidence has spread across several subfields of psychology and neuroscience. The role of one’s confidence is very crucial in the formation of attitude and communication skills. The importance of differentiating the levels of confidence is quite visible in this domain. With the recent advances in extracting behavioral insight from a signal in multiple applications, detecting confidence is found to have great importance. One such prominent application is detecting confidence in interview conversations. We have collected an audiovisual data set of interview conversations with 34 candidates. Every response (from each of the candidate) of this data set is labeled with three levels of confidence: high, medium, and low. Furthermore, we have also developed algorithms to efficiently compute such behavioral confidence from speech and video. A deep learning architecture is proposed for detecting confidence levels (high, medium, and low) from an audiovisual clip recorded during an interview. The achieved unweighted average recall (UAR) reaches 85.9% on audio data and 73.6% on video data captured from an interview session.


Author(s):  
Dan Stowell

Terrestrial bioacoustics, like many other domains, has recently witnessed some transformative results from the application of deep learning and big data (Stowell 2017, Mac Aodha et al. 2018, Fairbrass et al. 2018, Mercado III and Sturdy 2017). Generalising over specific projects, which bioacoustic tasks can we consider "solved"? What can we expect in the near future, and what remains hard to do? What does a bioacoustician need to understand about deep learning? This contribution will address these questions, giving the audience a concise summary of recent developments and ways forward. It builds on recent projects and evaluation campaigns led by the author (Stowell et al. 2015, Stowell et al. 2018), as well as broader developments in signal processing, machine learning and bioacoustic applications of these. We will discuss which type of deep learning networks are appropriate for audio data, how to address zoological/ecological applications which often have few available data, and issues in integrating deep learning predictions with existing workflows in statistical ecology.


Deep learning has been getting more attention towards the researchers for transforming input data into an effective representation through various learning algorithms. Hence it requires a large and variety of datasets to ensure good performance and generalization. But manually labeling a dataset is really a time consuming and expensive process, limiting its size. Some of websites like YouTube and Freesound etc. provide large volume of audio data along with their metadata. General purpose audio tagging is one of the newly proposed tasks in DCASE that can give valuable insights into classification of various acoustic sound events. The proposed work analyzes a large scale imbalanced audio data for a audio tagging system. The baseline of the proposed audio tagging system is based on Convolutional Neural Network with Mel Frequency Cepstral Coefficients. Audio tagging system is developed with Google Colaboratory on free Telsa K80 GPU using keras, Tensorflow, and PyTorch. The experimental result shows the performance of proposed audio tagging system with an average mean precision of 0.92 .


Author(s):  
Marcel Nikmon ◽  
Roman Budjač ◽  
Daniel Kuchár ◽  
Peter Schreiber ◽  
Dagmar Janáčová

Abstract Deep learning is a kind of machine learning, and machine learning is a kind of artificial intelligence. Machine learning depicts groups of various technologies, and deep learning is one of them. The use of deep learning is an integral part of the current data classification practice in today’s world. This paper introduces the possibilities of classification using convolutional networks. Experiments focused on audio and video data show different approaches to data classification. Most experiments use the well-known pre-trained AlexNet network with various pre-processing types of input data. However, there are also comparisons of other neural network architectures, and we also show the results of training on small and larger datasets. The paper comprises description of eight different kinds of experiments. Several training sessions were conducted in each experiment with different aspects that were monitored. The focus was put on the effect of batch size on the accuracy of deep learning, including many other parameters that affect deep learning [1].


Sensors ◽  
2022 ◽  
Vol 22 (2) ◽  
pp. 592
Author(s):  
Deokgyu Yun ◽  
Seung Ho Choi

This paper proposes an audio data augmentation method based on deep learning in order to improve the performance of dereverberation. Conventionally, audio data are augmented using a room impulse response, which is artificially generated by some methods, such as the image method. The proposed method estimates a reverberation environment model based on a deep neural network that is trained by using clean and recorded audio data as inputs and outputs, respectively. Then, a large amount of a real augmented database is constructed by using the trained reverberation model, and the dereverberation model is trained with the augmented database. The performance of the augmentation model was verified by a log spectral distance and mean square error between the real augmented data and the recorded data. In addition, according to dereverberation experiments, the proposed method showed improved performance compared with the conventional method.


2020 ◽  
Vol 77 ◽  
pp. 02004
Author(s):  
Aozora Kobayashi ◽  
Ian Wilson

The main purpose of this research is to test the use of deep learning for automatically classifying an English learner’s pronunciation proficiency, a step in the construction of a system that supports second language learners. Our deep learning dataset consists of 28 speakers – ranging in proficiency from native to beginner non-native – reading the same 216-word English story. In the supervised deep learning training process, we first label the English proficiency level of the data, but this is a complicated task because there are a number of different ways to determine someone’s speech proficiency. In this research, we focus on three elements: foreign accent, speech fluency (as measured by total number of pauses, total length of pauses, and speed of speech) and pronunciation (as measured by speech intelligibility). We use Long Short-Term Memory (LSTM) layers for deep learning, train a computer on differently labeled data, test a computer on separate data, and present the results. Features used from audio data are calculated by Mel-Frequency Cepstrum Coefficients (MFCCs) and pitch. We try several combinations of parameters for deep learning to find out what settings are best for our database. We also try changing the labeling method, changing the length of each audio sample, and changing the method of cross-validation. As a result, we conclude that labeling by speech fluency instead of by speech intelligibility tends to get better deep learning test accuracy.


2021 ◽  
Vol 11 ◽  
Author(s):  
David Dalmazzo ◽  
George Waddell ◽  
Rafael Ramírez

Repetitive practice is one of the most important factors in improving the performance of motor skills. This paper focuses on the analysis and classification of forearm gestures in the context of violin playing. We recorded five experts and three students performing eight traditional classical violin bow-strokes: martelé, staccato, detaché, ricochet, legato, trémolo, collé, and col legno. To record inertial motion information, we utilized the Myo sensor, which reports a multidimensional time-series signal. We synchronized inertial motion recordings with audio data to extract the spatiotemporal dynamics of each gesture. Applying state-of-the-art deep neural networks, we implemented and compared different architectures where convolutional neural networks (CNN) models demonstrated recognition rates of 97.147%, 3DMultiHeaded_CNN models showed rates of 98.553%, and rates of 99.234% were demonstrated by CNN_LSTM models. The collected data (quaternion of the bowing arm of a violinist) contained sufficient information to distinguish the bowing techniques studied, and deep learning methods were capable of learning the movement patterns that distinguish these techniques. Each of the learning algorithms investigated (CNN, 3DMultiHeaded_CNN, and CNN_LSTM) produced high classification accuracies which supported the feasibility of training classifiers. The resulting classifiers may provide the foundation of a digital assistant to enhance musicians' time spent practicing alone, providing real-time feedback on the accuracy and consistency of their musical gestures in performance.


Author(s):  
Ghada Alqubati ◽  
Ghaleb Algaphari

Alzheimer’s disease (AD) is a progressive neurodegenerative disorder. It can cause a massive impact on a patient's memory and mobility. As this disease is irreversible, early diagnosis is crucial for delaying the symptoms and adjusting the patient's lifestyle. Many machine learning (ML) and deep learning (DL) based-approaches have been proposed to accurately predict AD before its symptoms onset. However, finding the most effective approach for AD early prediction is still challenging. This review explored 24 papers published from 2018 until 2021. These papers have proposed different approaches using state of the art machine learning and deep learning algorithms on different biomarkers to early detect AD. The review explored them from different perspectives to derive potential research gaps and draw conclusions and recommendations. It classified these recent approaches in terms of the learning technique used and AD biomarkers. It summarized and compared their findings, and defined their strengths and limitations. It also provided a summary of the common AD biomarkers. From this review, it was found that some approaches strove to increase the prediction accuracy regardless of their complexity such as using heterogeneous datasets, while others sought to find the most practical and affordable ways to predict the disease and yet achieve good accuracy such as using audio data. It was also noticed that DL based-approaches with image biomarkers remarkably surpassed ML based-approaches. However, they achieved poorly with genetic variants data. Despite the great importance of genetic variants biomarkers, their large variance and complexity could lead to a complex approach or poor accuracy. These data are crucial to discover the underlying structure of AD and detect it at early stages. However, an effective pre-processing approach is still needed to refine these data and employ them efficiently using the powerful DL algorithms.


Sign in / Sign up

Export Citation Format

Share Document