Image Captioning

This task is to generate textual description of a digital image. Many models are based on sequence-to-sequence framework (CNN image encoder + RNN language model) plus attention mechanism.

Try it for yourself

1. Upload an Image (or choose one from the examples)

2. Run a model

Bottom-Up & Top-Down (BUTD) Attention

Bottom-Up and Top-Down Attention for Image Captioning and VQAPeter AndersonXiaodong HeChris BuehlerDamien TeneyMark JohnsonStephen GouldLei ZhangCVPR2018

Previous captioning models usually adopt only top-down attention to the sequence-to-sequence framework. This model combines top-down and bottom-up design into attention mechanism: An object detector (Faster R-CNN) proposes image regions (bottom-up), and a top-down attention module determines feature weightings of the proposed regions. This model won the 2017 VQA challenge.