scholarly journals A Modularized Architecture of Multi-Branch Convolutional Neural Network for Image Captioning

Electronics ◽  
2019 ◽  
Vol 8 (12) ◽  
pp. 1417 ◽  
Author(s):  
Shan He ◽  
Yuanyao Lu

Image captioning is a comprehensive task in computer vision (CV) and natural language processing (NLP). It can complete conversion from image to text, that is, the algorithm automatically generates corresponding descriptive text according to the input image. In this paper, we present an end-to-end model that takes deep convolutional neural network (CNN) as the encoder and recurrent neural network (RNN) as the decoder. In order to get better image captioning extraction, we propose a highly modularized multi-branch CNN, which could increase accuracy while maintaining the number of hyper-parameters unchanged. This strategy provides a simply designed network consists of parallel sub-modules of the same structure. While traditional CNN goes deeper and wider to increase accuracy, our proposed method is more effective with a simple design, which is easier to optimize for practical application. Experiments are conducted on Flickr8k, Flickr30k and MSCOCO entities. Results demonstrate that our method achieves state of the art performances in terms of caption quality.

2021 ◽  
Author(s):  
Weihao Zhuang ◽  
Tristan Hascoet ◽  
Xunquan Chen ◽  
Ryoichi Takashima ◽  
Tetsuya Takiguchi ◽  
...  

Abstract Currently, deep learning plays an indispensable role in many fields, including computer vision, natural language processing, and speech recognition. Convolutional Neural Networks (CNNs) have demonstrated excellent performance in computer vision tasks thanks to their powerful feature extraction capability. However, as the larger models have shown higher accuracy, recent developments have led to state-of-the-art CNN models with increasing resource consumption. This paper investigates a conceptual approach to reduce the memory consumption of CNN inference. Our method consists of processing the input image in a sequence of carefully designed tiles within the lower subnetwork of the CNN, so as to minimize its peak memory consumption, while keeping the end-to-end computation unchanged. This method introduces a trade-off between memory consumption and computations, which is particularly suitable for high-resolution inputs. Our experimental results show that MobileNetV2 memory consumption can be reduced by up to 5.3 times with our proposed method. For ResNet50, one of the most commonly used CNN models in computer vision tasks, memory can be optimized by up to 2.3 times.


Sensors ◽  
2019 ◽  
Vol 19 (8) ◽  
pp. 1795 ◽  
Author(s):  
Xiao Lin ◽  
Dalila Sánchez-Escobedo ◽  
Josep R. Casas ◽  
Montse Pardàs

Semantic segmentation and depth estimation are two important tasks in computer vision, and many methods have been developed to tackle them. Commonly these two tasks are addressed independently, but recently the idea of merging these two problems into a sole framework has been studied under the assumption that integrating two highly correlated tasks may benefit each other to improve the estimation accuracy. In this paper, depth estimation and semantic segmentation are jointly addressed using a single RGB input image under a unified convolutional neural network. We analyze two different architectures to evaluate which features are more relevant when shared by the two tasks and which features should be kept separated to achieve a mutual improvement. Likewise, our approaches are evaluated under two different scenarios designed to review our results versus single-task and multi-task methods. Qualitative and quantitative experiments demonstrate that the performance of our methodology outperforms the state of the art on single-task approaches, while obtaining competitive results compared with other multi-task methods.


2019 ◽  
Vol 9 (6) ◽  
pp. 1143 ◽  
Author(s):  
Sevinj Yolchuyeva ◽  
Géza Németh ◽  
Bálint Gyires-Tóth

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form. It has a highly essential role for natural language processing, text-to-speech synthesis and automatic speech recognition systems. In this paper, we investigate convolutional neural networks (CNN) for G2P conversion. We propose a novel CNN-based sequence-to-sequence (seq2seq) architecture for G2P conversion. Our approach includes an end-to-end CNN G2P conversion with residual connections and, furthermore, a model that utilizes a convolutional neural network (with and without residual connections) as encoder and Bi-LSTM as a decoder. We compare our approach with state-of-the-art methods, including Encoder-Decoder LSTM and Encoder-Decoder Bi-LSTM. Training and inference times, phoneme and word error rates were evaluated on the public CMUDict dataset for US English, and the best performing convolutional neural network-based architecture was also evaluated on the NetTalk dataset. Our method approaches the accuracy of previous state-of-the-art results in terms of phoneme error rate.


Author(s):  
Sergiy Pogorilyy ◽  
Artem Kramov

The detection of coreferent pairs within a text is one of the basic tasks in the area of natural language processing (NLP). The state‑ of‑ the‑ art methods of coreference resolution are based on machine learning algorithms. The key idea of the methods is to detect certain regularities between the semantic or grammatical features of text entities. In the paper, the comparative analysis of current methods of coreference resolution in English and Ukrainian texts has been performed. The key disadvantage of many methods consists in the interpretation of coreference resolution as a classification problem. The result of coreferent pairs detection is the set of groups in which elements refer to a common entity. Therefore it is advisable to consider the coreference resolution as a clusterization task. The method of coreference resolution using the set of filtering sieves and a convolutional neural network has been suggested. The set of filtering sieves to find candidates for coreferent pairs formation has been implemented. The training process of a multichannel convolutional neural network on a marked Ukrainian corpus has been performed. The usage of a multichannel structure allows analyzing of the different components of text units: semantic, lexical, and grammatical features of words and sentences. Furthermore, it is possible to process input data with unfixed size (words or sentences of a text) using a convolutional layer. The output result of the method is the set of clusters. In order to form clusters, it is necessary to take into account the previous steps of the model’s workflow. Nevertheless, such an approach contradicts the traditional methodology of machine learning. Thus, the training process of the network has been performed using the SEARN algorithm that allows the solving of tasks with unfixed output structures using a classifier model. An experimental examination of the method on the corpus of Ukrainian news has been performed. In order to estimate the accuracy of the method the corresponding common metrics for clusterization tasks have been calculated. The results obtained can indicate that the suggested method can be used to find coreferent pairs within Ukrainian texts. The method can be also easily adapted and applied to other natural languages.


Symmetry ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 98
Author(s):  
Vladimir Sergeevich Bochkov ◽  
Liliya Yurievna Kataeva

This article describes an AI-based solution to multiclass fire segmentation. The flame contours are divided into red, yellow, and orange areas. This separation is necessary to identify the hottest regions for flame suppression. Flame objects can have a wide variety of shapes (convex and non-convex). In that case, the segmentation task is more applicable than object detection because the center of the fire is much more accurate and reliable information than the center of the bounding box and, therefore, can be used by robotics systems for aiming. The UNet model is used as a baseline for the initial solution because it is the best open-source convolutional neural network. There is no available open dataset for multiclass fire segmentation. Hence, a custom dataset was developed and used in the current study, including 6250 samples from 36 videos. We compared the trained UNet models with several configurations of input data. The first comparison is shown between the calculation schemes of fitting the frame to one window and obtaining non-intersected areas of sliding window over the input image. Secondarily, we chose the best main metric of the loss function (soft Dice and Jaccard). We addressed the problem of detecting flame regions at the boundaries of non-intersected regions, and introduced new combinational methods of obtaining output signal based on weighted summarization and Gaussian mixtures of half-intersected areas as a solution. In the final section, we present UUNet-concatenative and wUUNet models that demonstrate significant improvements in accuracy and are considered to be state-of-the-art. All models use the original UNet-backbone at the encoder layers (i.e., VGG16) to demonstrate the superiority of the proposed architectures. The results can be applied to many robotic firefighting systems.


Author(s):  
Rishipal Singh ◽  
Rajneesh Rani ◽  
Aman Kamboj

Fruits classification is one of the influential applications of computer vision. Traditional classification models are trained by considering various features such as color, shape, texture, etc. These features are common for different varieties of the same fruit. Therefore, a new set of features is required to classify the fruits belonging to the same class. In this paper, we have proposed an optimized method to classify intra-class fruits using deep convolutional layers. The proposed architecture is capable of solving the challenges of a commercial tray-based system in the supermarket. As the research in intra-class classification is still in its infancy, there are challenges that have not been tackled. So, the proposed method is specifically designed to overcome the challenges related to intra-class fruits classification. The proposed method showcases an impressive performance for intra-class classification, which is achieved using a few parameters than the existing methods. The proposed model consists of Inception block, Residual connections and various other layers in very precise order. To validate its performance, the proposed method is compared with state-of-the-art models and performs best in terms of accuracy, loss, parameters, and depth.


2020 ◽  
Vol 6 (11) ◽  
pp. 127
Author(s):  
Ibrahem Kandel ◽  
Mauro Castelli ◽  
Aleš Popovič

The classification of the musculoskeletal images can be very challenging, mostly when it is being done in the emergency room, where a decision must be made rapidly. The computer vision domain has gained increasing attention in recent years, due to its achievements in image classification. The convolutional neural network (CNN) is one of the latest computer vision algorithms that achieved state-of-the-art results. A CNN requires an enormous number of images to be adequately trained, and these are always scarce in the medical field. Transfer learning is a technique that is being used to train the CNN by using fewer images. In this paper, we study the appropriate method to classify musculoskeletal images by transfer learning and by training from scratch. We applied six state-of-the-art architectures and compared their performance with transfer learning and with a network trained from scratch. From our results, transfer learning did increase the model performance significantly, and, additionally, it made the model less prone to overfitting.


2020 ◽  
pp. 004051752095522
Author(s):  
Feng Li ◽  
Feng Li

In this paper, a bag of tricks is proposed to improve the precision of fabric defect detection. Although the general state-of-the-art convolutional neural network detection algorithm can achieve a better detection effect, in fact, the detection precision still has enough room to improve on fabric defect detection. Therefore, we propose three tricks to further improve the precision. Firstly, we use multiscale training, which scales the single input image into a number of images of different resolutions for training, so as to be able to adapt to the box distribution of different scales. Secondly, we use the dimension clusters method. By observing the distribution of the width and the height of the defect size in the fabric dataset, we find that the distribution of the defect size in the dataset is extremely unbalanced and the size span is large. We believe that the training results of the default prior boxes setting might not be optimal, so we conduct dimensional clustering for the width and height of the defect size of the dataset, so as to make the network model easier to learn. Thirdly, we use soft non-maximum suppression instead of traditional non-maximum suppression to avoid the situation that the same kinds of defect category in the dataset are overlapped and eliminated as repeated detection. With this bag of tricks, we effectively improve the precision of fabric defect detection by 8.9% mAP on the basis of the baseline of state-of-the-art convolutional neural network detection algorithm.


2020 ◽  
Vol 3 (1) ◽  
pp. 138-146
Author(s):  
Subash Pandey ◽  
Rabin Kumar Dhamala ◽  
Bikram Karki ◽  
Saroj Dahal ◽  
Rama Bastola

 Automatically generating a natural language description of an image is a major challenging task in the field of artificial intelligence. Generating description of an image bring together the fields: Natural Language Processing and Computer Vision. There are two types of approaches i.e. top-down and bottom-up. For this paper, we approached top-down that starts from the image and converts it into the word. Image is passed to Convolutional Neural Network (CNN) encoder and the output from it is fed further to Recurrent Neural Network (RNN) decoder that generates meaningful captions. We generated the image description by passing the real time images from the camera of a smartphone as well as tested with the test images from the dataset. To evaluate the model performance, we used BLEU (Bilingual Evaluation Understudy) score and match predicted words to the original caption.


Author(s):  
Anish Banda

Abstract: In the model we proposed, we examine the deep neural networks-based image caption generation technique. We give image as input to the model, the technique give output in three different forms i.e., sentence in three different languages describing the image, mp3 audio file and an image file is also generated. In this model, we use the techniques of both computer vision and natural language processing. We are aiming to develop a model using the techniques of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to build a model to generate a Caption. Target image is compared with the training images, we have a large dataset containing the training images, this is done by convolutional neural network. This model generates a decent description utilizing the trained data. To extract features from images we need encoder, we use CNN as encoder. To decode the description of image generated we use LSTM. To evaluate the accuracy of generated caption we use BLEU metric algorithm. It grades the quality of content generated. Performance is calculated by the standard calculation matrices. Keywords: CNN, RNN, LSTM, BLEU score, encoder, decoder, captions, image description.


Sign in / Sign up

Export Citation Format

Share Document