Aspect-Aware Response Generation for Multimodal Dialogue System

2021 ◽  
Vol 12 (2) ◽  
pp. 1-33
Author(s):  
Mauajama Firdaus ◽  
Nidhi Thakur ◽  
Asif Ekbal

Multimodality in dialogue systems has opened up new frontiers for the creation of robust conversational agents. Any multimodal system aims at bridging the gap between language and vision by leveraging diverse and often complementary information from image, audio, and video, as well as text. For every task-oriented dialog system, different aspects of the product or service are crucial for satisfying the user’s demands. Based upon the aspect, the user decides upon selecting the product or service. The ability to generate responses with the specified aspects in a goal-oriented dialogue setup facilitates user satisfaction by fulfilling the user’s goals. Therefore, in our current work, we propose the task of aspect controlled response generation in a multimodal task-oriented dialog system. We employ a multimodal hierarchical memory network for generating responses that utilize information from both text and images. As there was no readily available data for building such multimodal systems, we create a Multi-Domain Multi-Modal Dialog (MDMMD++) dataset. The dataset comprises the conversations having both text and images belonging to the four different domains, such as hotels, restaurants, electronics, and furniture. Quantitative and qualitative analysis on the newly created MDMMD++ dataset shows that the proposed methodology outperforms the baseline models for the proposed task of aspect controlled response generation.

PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0241271
Author(s):  
Mauajama Firdaus ◽  
Arunav Pratap Shandeelya ◽  
Asif Ekbal

Multimodal dialogue system, due to its many-fold applications, has gained much attention to the researchers and developers in recent times. With the release of large-scale multimodal dialog dataset Saha et al. 2018 on the fashion domain, it has been possible to investigate the dialogue systems having both textual and visual modalities. Response generation is an essential aspect of every dialogue system, and making the responses diverse is an important problem. For any goal-oriented conversational agent, the system’s responses must be informative, diverse and polite, that may lead to better user experiences. In this paper, we propose an end-to-end neural framework for generating varied responses in a multimodal dialogue setup capturing information from both the text and image. Multimodal encoder with co-attention between the text and image is used for focusing on the different modalities to obtain better contextual information. For effective information sharing across the modalities, we combine the information of text and images using the BLOCK fusion technique that helps in learning an improved multimodal representation. We employ stochastic beam search with Gumble Top K-tricks to achieve diversified responses while preserving the content and politeness in the responses. Experimental results show that our proposed approach performs significantly better compared to the existing and baseline methods in terms of distinct metrics, and thereby generates more diverse responses that are informative, interesting and polite without any loss of information. Empirical evaluation also reveals that images, while used along with the text, improve the efficiency of the model in generating diversified responses.


Author(s):  
Markus Löckelt

This chapter describes a selection of experiences from designing and implementing virtual conversational characters for multimodal dialogue systems. It uses examples from the large interactive narrative VirtualHuman and some related systems of the task-oriented variety. The idea is not to give a comprehensive overview of any one system, but rather to identify and describe some issues that might also be relevant for the designer of a new system, to show how they can be addressed, and what problems still remain unresolved for future work. Besides giving an overview of how characters for interactive narrative systems can be built in the implementation level, the focus is on what should be in the knowledge base for virtual characters, and how it should be organized to be able to provide a convincing interaction with one or multiple characters.


Author(s):  
Beatriz López Mencía ◽  
David D. Pardo ◽  
Alvaro Hernández Trapote ◽  
Luis A. Hernández Gómez

One of the major challenges for dialogue systems deployed in commercial applications is to improve robustness when common low-level problems occur that are related with speech recognition. We first discuss this important family of interaction problems, and then we discuss the features of non-verbal, visual, communication that Embodied Conversational Agents (ECAs) bring ‘into the picture’ and which may be tapped into to improve spoken dialogue robustness and the general smoothness and efficiency of the interaction between the human and the machine. Our approach is centred around the information provided by ECAs. We deal with all stages of the conversation system development process, from scenario description, to gesture design and evaluation with comparative user tests. We conclude that ECAs can help improve the robustness of, as well as the users’ subjective experience with, a dialogue system. However, they may also make users more demanding and intensify privacy and security concerns.


2020 ◽  
Vol 12 (4) ◽  
pp. 1016-1046
Author(s):  
Hanif Fakhrurroja ◽  
◽  
Carmadi Machbub ◽  
Ary Setijadi Prihatmanto ◽  
Ayu Purwarianti ◽  
...  

Studies on human-machine interaction system show positive results on system development accuracy. However, there are problems, especially using certain input modalities such as speech, gesture, face detection, and skeleton tracking. These problems include how to design an interface system for a machine to contextualize the existing conversations. Other problems include activating the system using various modalities, right multimodal fusion methods, machine understanding of human intentions, and methods for developing knowledge. This study developed a method of human-machine interaction system. It involved several stages, including a multimodal activation system, methods for recognizing speech modalities, gestures, face detection and skeleton tracking, multimodal fusion strategies, understanding human intent and Indonesian dialogue systems, as well as machine knowledge development methods and the right response. The research contributes to an easier and more natural humanmachine interaction system using multimodal fusion-based systems. The average accuracy rate of multimodal activation, testing dialogue system using Indonesian, gesture recognition interaction, and multimodal fusion is 87.42%, 92.11%, 93.54% and 93%, respectively. The level of user satisfaction towards the multimodal recognition-based human-machine interaction system developed was 95%. According to 76.2% of users, this interaction system was natural, while 79.4% agreed that the machine responded well to their wishes.


Research ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-8
Author(s):  
Yangyang Zhou ◽  
Fuji Ren

The dialogue system has always been one of the important topics in the domain of artificial intelligence. So far, most of the mature dialogue systems are task-oriented based, while non-task-oriented dialogue systems still have a lot of room for improvement. We propose a data-driven non-task-oriented dialogue generator “CERG” based on neural networks. This model has the emotion recognition capability and can generate corresponding responses. The data set we adopt comes from the NTCIR-14 STC-3 CECG subtask, which contains more than 1.7 million Chinese Weibo post-response pairs and 6 emotion categories. We try to concatenate the post and the response with the emotion, then mask the response part of the input text character by character to emulate the encoder-decoder framework. We use the improved transformer blocks as the core to build the model and add regularization methods to alleviate the problems of overcorrection and exposure bias. We introduce the retrieval method to the inference process to improve the semantic relevance of generated responses. The results of the manual evaluation show that our proposed model can make different responses to different emotions to improve the human-computer interaction experience. This model can be applied to lots of domains, such as automatic reply robots of social application.


Author(s):  
Tomohiro Yoshikawa ◽  
◽  
Ryosuke Iwakura

Studies on automatic dialogue systems, which allow people and computers to communicate with each other using natural language, have been attracting attention. In particular, the main objective of a non-task-oriented dialogue system is not to achieve a specific task but to amuse users through chat and free dialogue. For this type of dialogue system, continuity of the dialogue is important because users can easily get tired if the dialogue is monotonous. On the other hand, preceding studies have shown that speech with humorous expressions is effective in improving the continuity of a dialogue. In this study, we developed a computer-based humor discriminator to perform user- or situation-independent objective discrimination of humor. Using the humor discriminator, we also developed an automatic humor generation system and conducted an evaluation experiment with human subjects to test the generated jokes. A t-test on the evaluation scores revealed a significant difference (P value: 3.5×10-5) between the proposed and existing methods of joke generation.


Author(s):  
Khaldoon H. Alhussayni ◽  
Alexander Zamyatin ◽  
S. Eman Alshamery

<div><p>Dialog state tracking (DST) plays a critical role in cycle life of a task-oriented dialogue system. DST represents the goals of the consumer at each step by dialogue and describes such objectives as a conceptual structure comprising slot-value pairs and dialogue actions that specifically improve the performance and effectiveness of dialogue systems. DST faces several challenges: diversity of linguistics, dynamic social context and the dissemination of the state of dialogue over candidate values both in slot values and in dialogue acts determined in ontology. In many turns during the dialogue, users indirectly refer to the previous utterances, and that produce a challenge to distinguishing and use of related dialogue history, Recent methods used and popular for that are ineffective. In this paper, we propose a dialogue historical context self-Attention framework for DST that recognizes relevant historical context by including previous user utterance beside current user utterances and previous system actions where specific slot-value piers variations and uses that together with weighted system utterance to outperform existing models by recognizing the related context and the relevance of a system utterance. For the evaluation of the proposed model the WoZ dataset was used. The implementation was attempted with the prior user utterance as a dialogue encoder and second by the additional score combined with all the candidate slot-value pairs in the context of previous user utterances and current utterances. The proposed model obtained 0.8 per cent better results than all state-of-the-art methods in the combined precision of the target, but this is not the turnaround challenge for the submission.</p></div>


Sign in / Sign up

Export Citation Format

Share Document