We present a novel approach to neural response prediction that incorporates
higher-order operations directly within convolutional neural networks (CNNs).
Our model extends traditional 3D CNNs by embedding higher-order operations
within the convolutional operator itself, enabling direct modeling of
multiplicative interactions between neighboring pixels across space and time.
Our model increases the representational power of CNNs without increasing their
depth, therefore addressing the architectural disparity between deep artificial
networks and the relatively shallow processing hierarchy of biological visual
systems. We evaluate our approach on two distinct datasets: salamander retinal
ganglion cell (RGC) responses to natural scenes, and a new dataset of mouse RGC
responses to controlled geometric transformations. Our higher-order CNN (HoCNN)
achieves superior performance while requiring only half the training data
compared to standard architectures, demonstrating correlation coefficients up
to 0.75 with neural responses (against 0.80
±0.02 retinal reliability). When
integrated into state-of-the-art architectures, our approach consistently
improves performance across different species and stimulus conditions. Analysis
of the learned representations reveals that our network naturally encodes
fundamental geometric transformations, particularly scaling parameters that
characterize object expansion and contraction. This capability is especially
relevant for specific cell types, such as transient OFF-alpha and transient ON
cells, which are known to detect looming objects and object motion
respectively, and where our model shows marked improvement in response
prediction. The correlation coefficients for scaling parameters are more than
twice as high in HoCNN (0.72) compared to baseline models (0.32).