This work combines vision and language to recognize and describe videos with one or more sentences. The video above shows this system in action captioning videos from the year 1 corpus of the DARPA Mind's Eye program. This research won both evaluations conducted against 11 other teams as part of the Mind's Eye program.
We develop an integrated approach to vision and language that in essence scores a video-sentence pair and can use this to perform three different tasks:
- generation: describe a video with one or more sentences,
- focus of attention: in a video where multiple simultaneous events occur highlight the event that corresponds to the given sentence,
- retrieval: given a sentence and a video corpus, find clips which depict that sentence.
This work progresses in several stages:
- simultaneous object detection and tracking
- simultaneous tracking and event recognition
- and simultaneous tracking and sentence recognition.
Work described here has appeared in:
- Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, Zhiqi Zhang, 'Video in sentences out', Conference on Uncertainty In Artificial Intelligence, August 2012.
- Andrei Barbu, Siddharth Narayanaswamy, Aaron Michaux, Jeffrey Mark Siskind, 'Simultaneous object detection, tracking, and event recognition', Advances in Cognitive Systems, December 2012.
- Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Siskind, Song Wang, 'Recognizing human activities from partially observed videos', IEEE Conference on Computer Vision and Pattern Recognition, June 2013.
- Daniel P. Barrett, Andrei Barbu, N. Siddharth, Jeffrey Mark Siskind, 'Saying what you're looking for: Linguistics meets video search', Transactions on Pattern Analysis and Machine Intelligence, October 2016.
- N. Siddharth, Andrei Barbu, Jeffrey Siskind, 'Seeing What You're Told: Sentence-Guided Activity Recognition In Video', IEEE Conference on Computer Vision and Pattern Recognition, June 2014.