Recent data-driven approaches to scene interpretation predominantly pose
inference as an end-to-end black-box mapping, commonly performed by a
Convolutional Neural Network (CNN). However, decades of work on perceptual
organization in both human and machine vision suggests that there are often
intermediate representations that are intrinsic to an inference task, and which
provide essential structure to improve generalization. In this work, we explore
an approach for injecting prior domain structure into neural network training
by supervising hidden layers of a CNN with intermediate concepts that normally
are not observed in practice. We formulate a probabilistic framework which
formalizes these notions and predicts improved generalization via this deep
supervision method. One advantage of this approach is that we are able to train
only from synthetic CAD renderings of cluttered scenes, where concept values
can be extracted, but apply the results to real images. Our implementation
achieves the state-of-the-art performance of 2D/3D keypoint localization and
image classification on real image benchmarks, including KITTI, PASCAL VOC,
PASCAL3D+, IKEA, and CIFAR100. We provide additional evidence that our approach
outperforms alternative forms of supervision, such as multi-task networks.