This work is combines different sources of knowledge in order to understand, manipulate, and describe the world. This is what humans do, the difference between us and machines is that we are somehow able to grasp the big picture when we look at an image, rather than get swamped in the details.

In order to demonstrate our ability to combine knowledge from multiple sources and modalities we develop an approach to recognizing, discussing, and manipulating 3D part-based objects. We combine knowledge from

  • multiple shape detectors,
  • possible parts,
  • physics,
  • multiple images from different viewpoints,
  • multiple images from different stages of assembly, and
  • natural language sentences.

In the process we reason about physical concepts such as occlusion and support, and about high-level human concepts such as windows, doors, and walls. A single unified approach to this problem is used in multiple ways: to recognize a structure, to describe it in natural language, to understand natural-language descriptions about it, to understand our confidence in our estimate of the structure, and to plan how we might increase that confidence (moving the robot camera, partially disassembling the structure, or asking a question).

This work has been published in: