Decision trees are important both as interpretable models amenable to
high-stakes decision-making, and as building blocks of ensemble methods such as
random forests and gradient boosting. Their statistical properties, however,
are not well understood. The most cited prior works have focused on deriving
pointwise consistency guarantees for CART in a classical nonparametric
regression setting. We take a different approach, and advocate studying the
generalization performance of decision trees with respect to different
generative regression models. This allows us to elicit their inductive bias,
that is, the assumptions the algorithms make (or do not make) to generalize to
new data, thereby guiding practitioners on when and how to apply these methods.
In this paper, we focus on sparse additive generative models, which have both
low statistical complexity and some nonparametric flexibility. We prove a sharp
squared error generalization lower bound for a large class of decision tree
algorithms fitted to sparse additive models with
C1 component functions.
This bound is surprisingly much worse than the minimax rate for estimating such
sparse additive models. The inefficiency is due not to greediness, but to the
loss in power for detecting global structure when we average responses solely
over each leaf, an observation that suggests opportunities to improve
tree-based algorithms, for example, by hierarchical shrinkage. To prove these
bounds, we develop new technical machinery, establishing a novel connection
between decision tree estimation and rate-distortion theory, a sub-field of
information theory.