This work combines vision and language to recognize and describe videos with one or more sentences. The video above shows this system in action captioning videos from the year 1 corpus of the DARPA Mind's Eye program. This research won both evaluations conducted against 11 other teams as part of the Mind's Eye program.

We develop an integrated approach to vision and language that in essence scores a video-sentence pair and can use this to perform three different tasks:

This work progresses in several stages:

The source code for this work is on github.

Work described here has appeared in: