Bi-Directional Co-Attention Network for Image Captioning

Weitao Jiang; Weixuan Wang; Haifeng Hu

doi:10.1145/3460474

Bi-Directional Co-Attention Network for Image Captioning

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3460474 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-20

Author(s):

Weitao Jiang ◽

Weixuan Wang ◽

Haifeng Hu

Keyword(s):

A Priori ◽

Attention Mechanism ◽

Superior Performance ◽

Significant Advance ◽

Visual Features ◽

Image Captioning ◽

Top Down ◽

Bottom Up ◽

Attention Network ◽

Benchmark Datasets

Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. However, most existing methods construct attention mechanisms based on singular visual features, such as patch features or object features, which limits the accuracy of generated captions. In this article, we propose a Bidirectional Co-Attention Network (BCAN) that combines multiple visual features to provide information from different aspects. Different features are associated with predicting different words, and there are a priori relations between these multiple visual features. Based on this, we further propose a bottom-up and top-down bi-directional co-attention mechanism to extract discriminative attention information. Furthermore, most existing methods do not exploit an effective multimodal integration strategy, generally using addition or concatenation to combine features. To solve this problem, we adopt the Multivariate Residual Module (MRM) to integrate multimodal attention features. Meanwhile, we further propose a Vertical MRM to integrate features of the same category, and a Horizontal MRM to combine features of the different categories, which can balance the contribution of the bottom-up co-attention and the top-down co-attention. In contrast to the existing methods, the BCAN is able to obtain complementary information from multiple visual features via the bi-directional co-attention strategy, and integrate multimodal information via the improved multivariate residual strategy. We conduct a series of experiments on two benchmark datasets (MSCOCO and Flickr30k), and the results indicate that the proposed BCAN achieves the superior performance.

Download Full-text

An historical framework for psychiatric nosology

Psychological Medicine ◽

10.1017/s0033291709005753 ◽

2009 ◽

Vol 39 (12) ◽

pp. 1935-1941 ◽

Cited By ~ 86

Author(s):

K. S. Kendler

Keyword(s):

Psychiatric Illness ◽

Theoretical Orientation ◽

A Priori ◽

Early History ◽

Top Down ◽

Species Definition ◽

Psychiatric Nosology ◽

Bottom Up ◽

Starting Point ◽

History Of

This essay, which seeks to provide an historical framework for our efforts to develop a scientific psychiatric nosology, begins by reviewing the classificatory approaches that arose in the early history of biological taxonomy. Initial attempts at species definition used top-down approaches advocated by experts and based on a few essential features of the organism chosena priori. This approach was subsequently rejected on both conceptual and practical grounds and replaced by bottom-up approaches making use of a much wider array of features. Multiple parallels exist between the beginnings of biological taxonomy and psychiatric nosology. Like biological taxonomy, psychiatric nosology largely began with ‘expert’ classifications, typically influenced by a few essential features, articulated by one or more great 19th-century diagnosticians. Like biology, psychiatry is struggling toward more soundly based bottom-up approaches using diverse illness characteristics. The underemphasized historically contingent nature of our current psychiatric classification is illustrated by recounting the history of how ‘Schneiderian’ symptoms of schizophrenia entered into DSM-III. Given these historical contingencies, it is vital that our psychiatric nosologic enterprise be cumulative. This can be best achieved through a process of epistemic iteration. If we can develop a stable consensus in our theoretical orientation toward psychiatric illness, we can apply this approach, which has one crucial virtue. Regardless of the starting point, if each iteration (or revision) improves the performance of the nosology, the eventual success of the nosologic process, to optimally reflect the complex reality of psychiatric illness, is assured.

Download Full-text

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition ◽

10.1109/cvpr.2018.00636 ◽

2018 ◽

Cited By ~ 512

Author(s):

Peter Anderson ◽

Xiaodong He ◽

Chris Buehler ◽

Damien Teney ◽

Mark Johnson ◽

...

Keyword(s):

Question Answering ◽

Image Captioning ◽

Top Down ◽

Bottom Up ◽

Visual Question Answering

Download Full-text

Mapping seabed assemblages using comparative top-down and bottom-up classification approaches

Canadian Journal of Fisheries and Aquatic Sciences ◽

10.1139/f06-058 ◽

2006 ◽

Vol 63 (7) ◽

pp. 1536-1548 ◽

Cited By ~ 28

Author(s):

Paul D Eastwood ◽

Sami Souissi ◽

Stuart I Rogers ◽

Roger A Coggan ◽

Craig J Brown

Keyword(s):

Low Cost ◽

A Priori ◽

Ground Truth ◽

Unsupervised Classification ◽

Physical Structure ◽

Unconsolidated Sediments ◽

Top Down ◽

Bottom Up ◽

Infaunal Community ◽

Acoustic Technologies

Acoustic technologies yield many benefits for mapping the physical structure of seabed environments but are not ideally suited to classifying associated biological assemblages. We tested this assumption using benthic infauna data collected off the south coast of England by applying top-down (supervised) and bottom-up (unsupervised) classification approaches. The top-down approach was based on an a priori acoustic classification of the seabed followed by characterization of the acoustic regions using ground-truth biological samples. By contrast, measures of similarity between the ground-truth infaunal community data formed the basis of the bottom-up approach to assemblage classification. For both approaches, individual assemblages were mapped by first computing Bayesian conditional probabilities for ground-truth stations to estimate the probability of each station belonging to an assemblage. Assemblage distributions were then interpolated over a regular grid and characterized using an indicator value index. While the two methods of classification yielded assemblages and output maps that were broadly comparable, the bottom-up approach arrived at a slightly better defined set of biological assemblages. This suggests that acoustically derived seabed data are not ideally suited to class ifying biological assemblages over unconsolidated sediments, despite offering considerable advantages in providing rapid and low-cost assessments of seabed physical structure.

Download Full-text

Boosting bottom-up and top-down visual features for saliency estimation

2012 IEEE Conference on Computer Vision and Pattern Recognition ◽

10.1109/cvpr.2012.6247706 ◽

2012 ◽

Cited By ~ 138

Author(s):

A. Borji

Keyword(s):

Visual Features ◽

Top Down ◽

Bottom Up

Download Full-text

What activates the human mirror neuron system during observation of artificial movements: Bottom-up visual features or top-down intentions?

Neuropsychologia ◽

10.1016/j.neuropsychologia.2008.01.025 ◽

2008 ◽

Vol 46 (7) ◽

pp. 2033-2042 ◽

Cited By ~ 21

Author(s):

Annerose Engel ◽

Michael Burke ◽

Katja Fiehler ◽

Siegfried Bien ◽

Frank Rösler

Keyword(s):

Mirror Neuron System ◽

Mirror Neuron ◽

Visual Features ◽

Top Down ◽

Bottom Up ◽

Neuron System

Download Full-text

A Bottom-Up and Top-Down Approach for Image Captioning using Transformer

Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing ◽

10.1145/3293353.3293391 ◽

2018 ◽

Author(s):

Sandeep Narayan Parameswaran ◽

Sukhendu Das

Keyword(s):

Image Captioning ◽

Top Down ◽

Bottom Up

Download Full-text

Inverse modelling of European CH<sub>4</sub> emissions during 2006–2012 using different inverse models and reassessed atmospheric observations

10.5194/acp-2017-273 ◽

2017 ◽

Cited By ~ 2

Author(s):

Peter Bergamaschi ◽

Ute Karstens ◽

Alistair J. Manning ◽

Marielle Saunois ◽

Aki Tsuruta ◽

...

Keyword(s):

A Priori ◽

Lower Troposphere ◽

Inverse Modelling ◽

Top Down ◽

Data Set ◽

Bottom Up ◽

Ch4 Emissions ◽

In Situ Data ◽

Inverse Models ◽

The Impact

Abstract. We present inverse modelling (top-down) estimates of European methane (CH4) emissions for 2006–2012 based on a new quality-controlled and harmonized in-situ data set from 18 European atmospheric monitoring stations. We applied an ensemble of seven inverse models and performed four inversion experiments, investigating the impact of different sets of stations and the use of a priori information on emissions. The inverse models infer total CH4 emissions of 26.7 (20.2–29.7) Tg CH4 yr−1 (mean, 10th and 90th percentiles from all inversions) for the EU-28 for 2006–2012 from the four inversion experiments. For comparison, total anthropogenic CH4 emissions reported to UNFCCC (bottom-up, based on statistical data and emissions factors) amount to only 21.3 Tg CH4 yr−1 (2006) to 18.8 Tg CH4 yr−1 (2012). A potential explanation for the higher range of top-down estimates compared to bottom-up inventories could be the contribution from natural sources, such as peatlands, wetlands, and wet soils. Based on seven different wetland inventories from the Wetland and Wetland CH4 Inter-comparison of Models Project (WETCHIMP) total wetland emissions of 4.3 (2.3–8.2) CH4 yr−1 from EU-28 are estimated. The hypothesis of significant natural emissions is supported by the finding that several inverse models yield significant seasonal cycles of derived CH4 emissions with maxima in summer, while anthropogenic CH4 emissions are assumed to have much lower seasonal variability. Furthermore, we investigate potential biases in the inverse models by comparison with regular aircraft profiles at four European sites and with vertical profiles obtained during the Infrastructure for Measurement of the European Carbon Cycle (IMECC) aircraft campaign. We present a novel approach to estimate the biases in the derived emissions, based on the comparison of simulated and measured enhancements of CH4 compared to the background, integrated over the entire boundary layer and over the lower troposphere. This analysis identifies regional biases for several models at the aircraft profile sites in France, Hungary and Poland.

Download Full-text

Estimating European volatile organic compound emissions using satellite observations of formaldehyde from the Ozone Monitoring Instrument

Atmospheric Chemistry and Physics Discussions ◽

10.5194/acpd-10-19697-2010 ◽

2010 ◽

Vol 10 (8) ◽

pp. 19697-19736 ◽

Cited By ~ 1

Author(s):

G. Curci ◽

P. I. Palmer ◽

T. P. Kurosu ◽

K. Chance ◽

G. Visconti

Keyword(s):

A Priori ◽

Satellite Observations ◽

Ozone Monitoring Instrument ◽

Top Down ◽

Bottom Up ◽

Voc Emissions ◽

The Balkans ◽

Monitoring Instrument ◽

Volatile Organic ◽

Ozone Monitoring

Abstract. Emission of non-methane Volatile Organic Compounds (VOCs) to the atmosphere stems from biogenic and human activities, and their estimation is difficult because of the many and not fully understood processes involved. In order to narrow down the uncertainty related to VOC emissions, which negatively reflects on our ability to simulate the atmospheric composition, we exploit satellite observations of formaldehyde (HCHO), an ubiquitous oxidation product of most VOCs, focusing on Europe. HCHO column observations from the Ozone Monitoring Instrument (OMI) reveal a marked seasonal cycle with a summer maximum and winter minimum. In summer, the oxidation of methane and other long-lived VOCs supply a slowly varying background HCHO column, while HCHO variability is dominated by most reactive VOC, primarily biogenic isoprene followed in importance by biogenic terpenes and anthropogenic VOCs. The chemistry-transport model CHIMERE qualitatively reproduces the temporal and spatial features of the observed HCHO column, but display regional biases which are attributed mainly to incorrect biogenic VOC emissions, calculated with the Model of Emissions of Gases and Aerosol from Nature (MEGAN) algorithm. These "bottom-up" or a-priori emissions are corrected through a Bayesian inversion of the OMI HCHO observations. Resulting "top-down" or a-posteriori isoprene emissions are lower than "bottom-up" by 40% over the Balkans and by 20% over Southern Germany, and higher by 20% over Iberian Peninsula, Greece and Italy. The inversion is shown to be robust against assumptions on the a-priori and the inversion parameters. We conclude that OMI satellite observations of HCHO can provide a quantitative "top-down" constraint on the European "bottom-up" VOC inventories.

Download Full-text

Salient Object Detection via Bottom-up and Top-down Visual Features

International Journal of Advancements in Computing Technology ◽

10.4156/ijact.vol5.issue6.92 ◽

2013 ◽

Vol 5 (6) ◽

pp. 785-793

Author(s):

Chao Jia ◽

Lin Yang ◽

Fang Hou ◽

Liangliang Duan

Keyword(s):

Object Detection ◽

Salient Object Detection ◽

Visual Features ◽

Salient Object ◽

Top Down ◽

Bottom Up

Download Full-text

A statistical mixture method to reveal bottom-up and top-down factors guiding the eye-movements

Journal of Eye Movement Research ◽

10.16910/jemr.3.2.5 ◽

2010 ◽

Vol 3 (2) ◽

Author(s):

Thomas Couronné ◽

Anne Guérin-Dugué ◽

Michel Dubois ◽

Pauline Faye ◽

Christian Marendaz

Keyword(s):

Eye Movements ◽

Visual Attention ◽

A Priori ◽

Affective State ◽

Cognitive Activity ◽

Spatial Density ◽

Top Down ◽

Bottom Up ◽

Eye Fixations ◽

Proposed Model

When people gaze at real scenes, their visual attention is driven both by a set of bottom-up processes coming from the signal properties of the scene and also from top-down effects such as the task, the affective state, prior knowledge, or the semantic context. The context of this study is an assessment of manufactured objects (here car cab interior). From this dedicated context, this work describes a set of methods to analyze the eye-movements during the visual scene evaluation. But these methods can be adapted to more general contexts. We define a statistical model to explain the eye fixations measured experimentally by eye-tracking even when the ratio signal/noise is bad or lacking of raw data. One of the novelties of the approach is to use complementary experimental data obtained with the “Bubbles” paradigm. The proposed model is an additive mixture of several a priori spatial density distributions of factors guiding visual attention. The “Bubbles” paradigm is adapted here to reveal the semantic density distribution which represents here the cumulative effects of the top-down factors. Then, the contribution of each factor is compared depending on the product and on the task, in order to highlight the properties of the visual attention and the cognitive activity in each situation.

Download Full-text