scholarly journals RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer

2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xu Zhang ◽  
DeZhi Han ◽  
Chin-Chen Chang

Visual question answering (VQA) is the natural language question-answering of visual images. The model of VQA needs to make corresponding answers according to specific questions based on understanding images, the most important of which is to understand the relationship between images and language. Therefore, this paper proposes a new model, Representation of Dense Multimodality Fusion Encoder Based on Transformer, for short, RDMMFET, which can learn the related knowledge between vision and language. The RDMMFET model consists of three parts: dense language encoder, image encoder, and multimodality fusion encoder. In addition, we designed three types of pretraining tasks: masked language model, masked image model, and multimodality fusion task. These pretraining tasks can help to understand the fine-grained alignment between text and image regions. Simulation results on the VQA v2.0 data set show that the RDMMFET model can work better than the previous model. Finally, we conducted detailed ablation studies on the RDMMFET model and provided the results of attention visualization, which proves that the RDMMFET model can significantly improve the effect of VQA.

AI Magazine ◽  
2014 ◽  
Vol 35 (1) ◽  
pp. 38 ◽  
Author(s):  
Ulli Waltinger ◽  
Dan Tecuci ◽  
Mihaela Olteanu ◽  
Vlad Mocanu ◽  
Sean Sullivan

This paper describes USI Answers — a natural language question answering system for enterprise data. We report on the progress towards the goal of offering easy access to enterprise data to a large number of business users, most of whom are not familiar with the specific syntax or semantics of the underlying data sources. Additional complications come from the nature of the data, which comes both as structured and unstructured. The proposed solution allows users to express questions in natural language, makes apparent the system's interpretation of the query, and allows easy query adjustment and reformulation. The application is in use by more than 1500 users from Siemens Energy. We evaluate our approach on a data set consisting of fleet data.


Author(s):  
Xinmeng Li ◽  
Mamoun Alazab ◽  
Qian Li ◽  
Keping Yu ◽  
Quanjun Yin

AbstractKnowledge graph question answering is an important technology in intelligent human–robot interaction, which aims at automatically giving answer to human natural language question with the given knowledge graph. For the multi-relation question with higher variety and complexity, the tokens of the question have different priority for the triples selection in the reasoning steps. Most existing models take the question as a whole and ignore the priority information in it. To solve this problem, we propose question-aware memory network for multi-hop question answering, named QA2MN, to update the attention on question timely in the reasoning process. In addition, we incorporate graph context information into knowledge graph embedding model to increase the ability to represent entities and relations. We use it to initialize the QA2MN model and fine-tune it in the training process. We evaluate QA2MN on PathQuestion and WorldCup2014, two representative datasets for complex multi-hop question answering. The result demonstrates that QA2MN achieves state-of-the-art Hits@1 accuracy on the two datasets, which validates the effectiveness of our model.


Author(s):  
Lianli Gao ◽  
Pengpeng Zeng ◽  
Jingkuan Song ◽  
Yuan-Fang Li ◽  
Wu Liu ◽  
...  

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.


2019 ◽  
Vol 29 (11n12) ◽  
pp. 1801-1818
Author(s):  
Yixiao Yang ◽  
Xiang Chen ◽  
Jiaguang Sun

In last few years, applying language model to source code is the state-of-the-art method for solving the problem of code completion. However, compared with natural language, code has more obvious repetition characteristics. For example, a variable can be used many times in the following code. Variables in source code have a high chance to be repetitive. Cloned code and templates, also have the property of token repetition. Capturing the token repetition of source code is important. In different projects, variables or types are usually named differently. This means that a model trained in a finite data set will encounter a lot of unseen variables or types in another data set. How to model the semantics of the unseen data and how to predict the unseen data based on the patterns of token repetition are two challenges in code completion. Hence, in this paper, token repetition is modelled as a graph, we propose a novel REP model which is based on deep neural graph network to learn the code toke repetition. The REP model is to identify the edge connections of a graph to recognize the token repetition. For predicting the token repetition of token [Formula: see text], the information of all the previous tokens needs to be considered. We use memory neural network (MNN) to model the semantics of each distinct token to make the framework of REP model more targeted. The experiments indicate that the REP model performs better than LSTM model. Compared with Attention-Pointer network, we also discover that the attention mechanism does not work in all situations. The proposed REP model could achieve similar or slightly better prediction accuracy compared to Attention-Pointer network and consume less training time. We also find other attention mechanism which could further improve the prediction accuracy.


2007 ◽  
Vol 13 (2) ◽  
pp. 185-189
Author(s):  
ROBERT DALE

“Powerset Hype to Boiling Point”, said a February headline on TechCrunch. In the last installment of this column, I asked whether 2007 would be the year of question-answering. My query was occasioned by a number of new attempts at natural language question-answering that were being promoted in the marketplace as the next advance upon search, and particularly by the buzz around the stealth-mode natural language search company Powerset. That buzz continued with a major news item in the first quarter of this year: in February, Xerox PARC and PowerSet struck a much-anticipated deal whereby PowerSet won exclusive rights to use PARC's natural language technology, as announced in a VentureBeat posting. Following the scoop, other news sources drew the battle lines with titles like “Can natural language search bring down Google?”, “Xerox vs. Google?”, and “Powerset and Xerox PARC team up to beat Google”. An April posting on Barron's Online noted that an analyst at Global Equities Research had cited Powerset in his downgrading of Google from Buy to Neutral. And, all this on the basis of a product which, at the time of writing, very few people have actually seen. Indications are that the search engine is expected to go live by the end of the year, so we have a few more months to wait to see whether this really is a Google-killer. Meanwhile, another question remaining unanswered is what happened to the Powerset engineer who seemed less sure about the technology's capabilities: see the segment at the end of D7TV's PartyCrasher video from the Powerset launch party. For a more confident appraisal of natural language search, check out the podcast of Barney Pell, CEO of Powerset, giving a lecture at the University of California–Berkeley.


2010 ◽  
Vol 23 (2-3) ◽  
pp. 241-265 ◽  
Author(s):  
Ulrich Furbach ◽  
Ingo Glöckner ◽  
Björn Pelzer

Entropy ◽  
2020 ◽  
Vol 22 (5) ◽  
pp. 533
Author(s):  
Qin Zhao ◽  
Chenguang Hou ◽  
Changjian Liu ◽  
Peng Zhang ◽  
Ruifeng Xu

Quantum-inspired language models have been introduced to Information Retrieval due to their transparency and interpretability. While exciting progresses have been made, current studies mainly investigate the relationship between density matrices of difference sentence subspaces of a semantic Hilbert space. The Hilbert space as a whole which has a unique density matrix is lack of exploration. In this paper, we propose a novel Quantum Expectation Value based Language Model (QEV-LM). A unique shared density matrix is constructed for the Semantic Hilbert Space. Words and sentences are viewed as different observables in this quantum model. Under this background, a matching score describing the similarity between a question-answer pair is naturally explained as the quantum expectation value of a joint question-answer observable. In addition to the theoretical soundness, experiment results on the TREC-QA and WIKIQA datasets demonstrate the computational efficiency of our proposed model with excellent performance and low time consumption.


Sign in / Sign up

Export Citation Format

Share Document