Abstract

Visual attention has shown usefulness in image captioning, with the goal of enabling a caption model to selectively focus on regions of interest. Existing models typically rely on top-down language information and learn attention implicitly by optimizing the captioning objectives. While somewhat effective, the learned top-down attention can fail to focus on correct regions of interest without direct supervision of attention. Inspired by the human visual system which is driven by not only the task-specific top-down signals but also the visual stimuli, we in this work propose to use both types of attention for image captioning. In particular, we highlight the complementary nature of the two types of attention and develop a model (Boosted Attention) to integrate them for image captioning.

Resources

Shi Chen and Qi Zhao, "Boosted Attention: Leveraging Human Attention for Image Captioning," in ECCV 2018 [pdf] [supplementary] [poster] [bib]

Why Human Attention

Without knowing where to look, top-down model attention (derived from the targeted task) can fail to focus on objects of interest and attend to irrelevant regions: as shown in the figure below, a model focuses on non-salient regions in the background and does not capture salient objects in the image, i.e., the bulldog and teddy bear, according to the human-generated caption.

Human stimulus-based attention offers abundant knowledge on where to look, which can be used to complement the top-down model attention and benefit captioning models by attending to correct regions of interest. In the figure below, we see that stimulus-based attention successfully attends to regions corresponding to objects of interest as mentioned in the human-generated caption.

Top-down attention may fail to focus on objects of interest. (a): original image with human-generated caption, (b-c) two top-down attention maps and their corresponding model-generated captions, and (d) stimulus-based attention map for the image. Words related to the top-down attention maps are colored in red.

Model to Integrate Human Attention

We propose a Boosted Attention method that combines the two types of attention and enables them to complement each other.

An high-level illustration on proposed Boosted Attention method.

Corporation between Human Attention and Top-Down Model Attention

We explore how the two types of attention, human stimulus-based attention and top-down model attention, corporate with each other during the caption generation process. Our quantitative results show that different types of attention maps have negative correlations (Coefficient Correlation = -0.256, Spearman Rank Correlation = -0.369), indicating that the two types of attention corporate in a complementary manner.

Based on the qualitative comparisons (examples shown in the figure below), we summarize three typical scenarios for attention corporation:

  1. Stimulus-based attention has successfully captured all of the objects of interest corresponding to generated caption (row 1-2). In this case, top-down attention tends to play a minor role on discriminating the salient regions related to the task.
  2. Stimulus-based attention concentrates on only part of an object but not covering the entire object (row 3), or it covers some but not all objects of interest (row 4). Under these situations, top-down attention will focus on the missing regions to enhance the objects of interest and complement stimulus-based attention.
  3. Stimulus-based attention fails to distinguish salient objects with irrelevant background (row 5). In this case, top-down attention will play a major role in extracting regions corresponding to the objects of interest.
Qualitative results illustrating that the two types of attention complement each other in various situations. From left to right: original images with generated captions, stimulus-based attention maps, top-down model attention maps for different words within the captions. The word associated with a specific top-down attention map is highlighted in red color.

Experimental Results

Quantitative Results on Flickr30K and MSCOCO datasets with two different baselines (our own baseline and Att2in).
Qualitative results for models with and without using the Boosted Attention method. From left to right: original images, stimulus-based attention map, and captions corresponding to the images. Captions generated by models with and without using Boosted Attention method are colored in red and black respectively, while the ground-truth human generated captions are colored in blue.