This task is to generate textual description of a digital image. Many models are based on sequence-to-sequence framework (CNN image encoder + RNN language model) plus attention mechanism.
Previous captioning models usually adopt only top-down attention to the sequence-to-sequence framework. This model combines top-down and bottom-up design into attention mechanism: An object detector (Faster R-CNN) proposes image regions (bottom-up), and a top-down attention module determines feature weightings of the proposed regions. This model won the 2017 VQA challenge.