
Feature Learning in InfiniteWidth Neural Networks
As its width tends to infinity, a deep neural network's behavior under g...
read it

Neural Spectrum Alignment
Expressiveness of deep models was recently addressed via the connection ...
read it

Weighted Neural Tangent Kernel: A Generalized and Improved NetworkInduced Kernel
The Neural Tangent Kernel (NTK) has recently attracted intense study, as...
read it

Separation Results between FixedKernel and FeatureLearning Probability Metrics
Several works in implicit and explicit generative modeling empirically o...
read it

On the linearity of large nonlinear models: when and why the tangent kernel is constant
The goal of this work is to shed light on the remarkable phenomenon of t...
read it

Why does CTC result in peaky behavior?
The peaky behavior of CTC models is well known experimentally. However, ...
read it

Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks
We analyze the learning dynamics of infinitely wide neural networks with...
read it
Rapid Feature Evolution Accelerates Learning in Neural Networks
Neural network (NN) training and generalization in the infinitewidth limit are wellcharacterized by kernel methods with a neural tangent kernel (NTK) that is stationary in time. However, finitewidth NNs consistently outperform corresponding kernel methods, suggesting the importance of feature learning, which manifests as the time evolution of NTKs. Here, we analyze the phenomenon of kernel alignment of the NTK with the target functions during gradient descent. We first provide a mechanistic explanation for why alignment between task and kernel occurs in deep linear networks. We then show that this behavior occurs more generally if one optimizes the feature map over time to accelerate learning while constraining how quickly the features evolve. Empirically, gradient descent undergoes a feature learning phase, during which top eigenfunctions of the NTK quickly align with the target function and the loss decreases faster than power law in time; it then enters a kernel gradient descent (KGD) phase where the alignment does not improve significantly and the training loss decreases in power law. We show that feature evolution is faster and more dramatic in deeper networks. We also found that networks with multiple output nodes develop separate, specialized kernels for each output channel, a phenomenon we termed kernel specialization. We show that this classspecific alignment is does not occur in linear networks.
READ FULL TEXT
Comments
There are no comments yet.