This work combines vision and language to recognize and describe videos with one or more sentences. The video above shows this system in action captioning videos from the year 1 corpus of the DARPA Mind's Eye program. This research won both evaluations conducted against 11 other teams as part of the Mind's Eye program.

We develop an integrated approach to vision and language that in essence scores a video-sentence pair and can use this to perform three different tasks:

  • generation: describe a video with one or more sentences,
  • focus of attention: in a video where multiple simultaneous events occur highlight the event that corresponds to the given sentence,
  • retrieval: given a sentence and a video corpus, find clips which depict that sentence.

This work progresses in several stages:

  • tracking
  • event-recognition
  • simultaneous object detection and tracking
  • simultaneous tracking and event recognition
  • and simultaneous tracking and sentence recognition.

<The source code for this work is on github.>The source code for this work is on github.>

Work described here has appeared in: