Grounding Knowledge

Home Research Scheme→C Hurd N900 Code Reading CV
Hardware Game learning Grounding knowledge Video to sentences
Joint work with Jeffrey Mark Siskind and Siddharth Narayanaswamy

A visual language model for estimating object pose and structure in a generative visual domain, ICRA 2011

Disassembly of a Lincoln Log structure

After the arm is calibrated, the pose and structure of the Lincoln Log assembly are determined, then the structure, whose pose and contents have been determined solely from visual input, is disassembled.

Disassembly of a Lincoln Log structure from multiple views and linguistic input

After the arm is calibrated, the pose and structure of the Lincoln Log assembly are determined. One view is insufficient to disambiguate the structure but the structure is disambiguated in concert with a second view. Then the second view is forgotten and a linguistic constraint is applied, which is also able to disambiguate the structure. The structure thus determined is then disassembled.

Alternative content

Get Adobe Flash player

Alternative content

Get Adobe Flash player

Lincoln Log structure estimation from a single image

Once we have determined the pose of a Lincoln Log assembly (left) we can correctly determine the types and positions of the logs (shown in green) that constitute the assembly (right).

Structure estimation from spatially distinct views

Due to occlusion, a single view (left) may provide insufficient information to support correct structure estimation (the false negative shown in orange).
Integrating information from a second view (right) of the same structure, prior to disassembly, can correct the error.
Correctly determined absence of logs is shown in blue.

Structure estimation from temporally distinct views

Another way to recover occluded information is to begin the task of disassembly with partial information (left) and then reimage the structure from the same camera pose part-way through disassembly after the occlusion has been eliminated.
The information from two temporally distinct views of distinct assembly states can be integrated to yield a correct model of the initial structure (right).

Structure estimation given constraints

Occluded information can be recovered from a single image by constraining the space of possible structures (in this case specification of the piece inventory).
Our goal is for multiple agents to communicate such constraints linguistically and infer such constraints through high-level reasoning.

From language to motor control to vision to language

One of the reasons for grounding language is to allow the transfer of information between modalities. Language can drive motor control, we give the system a sentence (Seven doors exist), it finds a structure which satisfies it, which after being built is recognized by the vision system and produces the same sentence in the end. "Seven doors exist" fully describes this structure given its size.

Alternative content

Get Adobe Flash player

Alternative content

Get Adobe Flash player