Visual Question Answering (VQA) is the task of generating a answer in response to a natural language question about the contents of an image. VQA models are typically trained and evaluated on datasets such as VQA2.0, GQA, Visual7W and VizWiz.
This is a modular re-implementation of the bottom-up top-down (up-down) model (Anderson et al) with subtle but important changes to the model architecture and the learning rate schedule, finetuning image features, and adding data augmentation. This model was the winning entry to the VQA Challenge in 2018.