This task is to produce concise summaries of situations in images depicting: (1) the main activity, (2) the participating actors, objects, substances, and locations and most importantly (3) the roles these participants play in the activity.
This model uses a conditional random field acting on potentials calculated from a neural net. These potentials contain information about the nouns or potential objects in the image, the role these objects have in the scene and a representation of the entire image. The representation for the roles are specific for the image while the representations for the nouns are global. The action for an image is calculated directly from the image representation.