In review
Saying what you're looking for: linguistics meets video search
Andrei Barbu, N. Siddharth, Jeffrey Mark Siskind
journal
In review at: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)
We present an approach to searching large video corpora for video clips which depict a natural-language query in the form of a sentence. This approach uses compositional semantics to encode subtle meaning that is lost in other systems, such as the difference between two sentences which have identical words but entirely different meaning: 'The person rode the horse' vs. 'The horse rode the person'. We demonstrate this approach by searching for 141 queries involving people and horses interacting with each other in 10 full-length Hollywood movies.
In review
A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video
Andrei Barbu, N. Siddharth, Jeffrey Mark Siskind
journal
In review at: Journal of Artificial Intelligence Research (JAIR)
We present an approach to searching large video corpora for video clips which depict a natural-language query in the form of a sentence. This approach uses compositional semantics to encode subtle meaning that is lost in other systems, such as the difference between two sentences which have identical words but entirely different meaning: 'The person rode the horse' vs. 'The horse rode the person'. We demonstrate this approach by searching for 141 queries involving people and horses interacting with each other in 10 full-length Hollywood movies.
In review
The compositional nature of verb and argument representations in the human brain
Andrei Barbu, N. Siddharth, Caiming Xiong, Jason J. Corso, Christiane D. Fellbaum, Catherine Hanson, Stephen Jose Hason, Sebastien Helie, Evguenia Malaia, Barak A. Pearlmutter, Jeffrey Mark Siskind, Thomas Michael Talavage, Ronnie B. Wilbur
journal
In review at: Science
We seek to understand the compositional nature of representations in the human brain. Subjects were shown videos of humans interacting with objects, and descriptions of those videos were recovered by identifying thoughts from fMRI activation patterns. Leading up to this result, we demonstrate the novel ability to decode thoughts corresponding to one of six verbs. Robustness of these results is demonstrated by replication at two different sites, and by preliminary work indicating the commonality of neural representations for verbs across subjects. We next demonstrate the ability to decode thoughts of multiple entities---the class of an object and the identity of an actor---simultaneously. These novel abilities allow us to show that the neural representation for argument structure may be compositional by decoding a complex thought composed of an actor, a verb, a direction, and an object.
In review
Video Retrieval with Sentential Queries
Andrei Barbu, N. Siddharth, Jeffrey Mark Siskind
conference
In review at: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
We present an approach to searching large video corpora for video clips which depict a natural-language query in the form of a sentence. This approach uses compositional semantics to encode subtle meaning that is lost in other systems, such as the difference between two sentences which have identical words but entirely different meaning: 'The person rode the horse' vs. 'The horse rode the person'. We demonstrate this approach by searching for 141 queries involving people and horses interacting with each other in 10 full-length Hollywood movies.
In review
Seeing What You're Told: Sentence-Guided Activity Recognition In Video
N. Siddharth, Andrei Barbu, Jeffrey Siskind
conference
In review at: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, provide a mechanism for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process.
June 2013
Recognizing human activities from partially observed videos
Yu Cao, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Siskind, Song Wang
conference, poster (472/1870, 25%)
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Recognizing partially observed activities in videos using sparse encoding.
December 2012
Simultaneous object detection, tracking, and event recognition
Andrei Barbu, Siddharth Narayanaswamy, Aaron Michaux, Jeffrey Mark Siskind
journal, oral at associated conference (14/38, 37%)
Instead of recognizing hammering by detecting the motion of a hammer, we simultaneously search for hammering and the hammer, your knowledge about the event helps you recognize that event. By combining all three (object detection, tracking and event recognition) into one cost function they are seamlessly integrated and top-down information can influence bottom-up visual processing and vice-versa.
December 2012
Seeing unseeability to see the unseeable
Siddharth Narayanaswamy, Andrei Barbu, Jeffrey Mark Siskind
journal, oral at associated conference (14/38, 37%)
There are many things you don't need to look at in order to understand or describe, for example you might see a rooftop and by virtue of the fact that it isn't speeding toward the ground you know there is a wall supporting it. Someone might also tell you, "Hey, there's a window over there by the door". This brings up four questions this paper addresses: "how do you infer knowledge from physical constraints of the world?", "how confident are you that what you're seeing is real?", "what can you do to raise that confidence?", and "how do you describe what you're seeing?".
August 2012
Video in sentences out
Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, Zhiqi Zhang
conference, oral (24/304, 8%)
Conference on Uncertainty In Artificial Intelligence (UAI)
Recognizing actions in videos and putting them in context by recognizing and tracking the participants and generating rich sentences that describe the action, its manner, the participants, and their changing spatial relationships. For the sample image we generate "The person slowly arrived from the right" and "The person slowly went leftward."
April 2012
Large-Scale Automatic Labeling of Video Events with Verbs Based on Event-Participant Interaction
Andrei Barbu, Alexander Bridge, Dan Coroian, Sven Dickinson, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Wang, Jinlian Wei, Yifan Yin, Zhiqi Zhang
technical report
CoRR (arXiv:1204.3616)
We present an approach to labeling short video clips with English verbs as event descriptions. A key distinguishing aspect of this work is that it labels videos with verbs that describe the spatiotemporal interaction between event participants, humans and objects interacting with each other, abstracting away all object-class information and fine-grained image characteristics, and relying solely on the coarse-grained motion of the event participants. We apply our approach to a large set of 22 distinct verb classes and a corpus of 2,584 videos, yielding two surprising outcomes. First, a classification accuracy of greater than 70% on a 1-out-of-22 labeling task and greater than 85% on a variety of 1-out-of-10 subsets of this labeling task is independent of the choice of which of two different time-series classifiers we employ. Second, we achieve this level of accuracy using a highly impoverished intermediate representation consisting solely of the bounding boxes of one or two event participants as a function of time. This indicates that successful event recognition depends more on the choice of appropriate features that characterize the linguistic invariants of the event classes than on the particular classifier algorithms.
May 2011
A visual language model for estimating object pose and structure in a generative visual domain
Siddharth Narayanaswamy, Andrei Barbu, Jeffrey Mark Siskind
conference, oral (982/2004, 49%)
IEEE International Conference on Robotics and Automation (ICRA)
Understanding and manipulating complex scenes through their components, using a visual language model which encodes physical constraints, such as "something must be supported otherwise it will fall" and the 3D affordances of parts which allow them to be assembled to form entire objects.
May 2010
Learning physically-instantiated game play through visual observation
Andrei Barbu, Siddharth Narayanaswamy, Jeffrey Mark Siskind
conference, oral (856/2062, 42%)
IEEE International Conference on Robotics and Automation (ICRA)
An approach to learning complex rules of social interaction by observing real agents, either robotic or human, in the real world and then using that knowledge to drive futher iteraction with them.