Deep neural networks possess the theoretical capacity to represent complex logical functions, yet gradient descent frequently fails to discover these solutions, highlighting a severe expressivity-learnability gap. This failure often occurs because networks remain trapped in a lazy, kernel-like training regime where internal representations fail to adapt to the target distribution. By understanding how the structural complexity of weight matrices evolves during the transition to rich feature learning, we can design optimization strategies that unlock a network's full representational power.
Approach
We propose a novel training framework that actively monitors and regularizes the spectral complexity of weight matrices to force networks out of the lazy regime and into a rich feature-learning state. Building on the concept of weight expansion from [Weight Expansion: A New Perspective on Dropout and Generalization](/paper/art_0e2821aaa9ed44dcaf878db5a49d0922), we introduce a loss penalty that maximizes the normalized determinant of the weight covariance matrix during the initial training epochs. This explicit volume expansion prevents the network from acting as a static kernel, a failure mode identified in [Lecture notes: From Gaussian processes to feature learning](/paper/art_bc2cb6266150455e9833715034887ebf). By tracking the effective rank of the neural tangent kernel as proposed in [Implicit Regularization via Neural Feature Alignment](/paper/art_fc9efec14d724374b0beb67f55028a44), we dynamically adjust this regularization to ensure the network successfully navigates the non-convex landscape required to learn complex functions.
Experimental Plan
We evaluate our approach on the Majority Boolean Logic benchmark and synthetic parity tasks, where standard gradient descent provably fails as shown in [Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent](/paper/art_0218d60cf3d442969ad081fe3ebead79). Our primary hypothesis is that spectral volume regularization will enable standard MLPs and Transformers to achieve high accuracy on these tasks by forcing early feature alignment, whereas unregularized models will remain at random chance. We compare our method against standard weight decay, dropout, and the $\mu$P initialization scheme from [Non-Gaussian Tensor Programs](/paper/art_8dbfeb034b9b495eae9f4f16ca5f8e5a). Metrics include final test accuracy, generalization gap, and the layer-wise effective rank measured at epoch 10 to validate the early transition into the rich learning phase.