A general purpose vision system can solve many different vision and vision+language tasks with a unified architecture. Tasks are performed by training the model to respond to an image and a natural language task description. For example, the question “What is this?” prompts the model to perform image classification, Some systems can also use a bounding box as input as perform additional tasks, such as classifying a particular region of the image. General purpose vision systems can learn new tasks quickly and transfer knowledge between tasks.
A GPV that uses the VinVL object detector and T5 language model. It is trained on data from the MS COCO dataset for five tasks including classification, localization, visual question answering, captioning and classification-in-context. GPV 2 is also trained on web-search that contains over 10,000 visual concepts so it understands a larger range of objects and actions then what appears in MS COCO images.
A GPV that uses the DETR object detector and can be trained in a fully end-to-end manner. It is trained with MS COCO data on four different tasks including classification, localization, visual question answering, and captioning. This model does not account for the bounding box input.