https://www.arxiv.org/abs/1812.05994
Given products of random matrices when the number of terms and the size of the matrices tend to infinity, show that logarithm of L2 norm of such a product applied to any fixed vector is asymptotically Gaussian.
https://proceedings.neurips.cc/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf
Show that pointwise nonlinearities can be incorporated into a standard method of proof in random matrix theory known as the moments method.
Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case
https://www.arxiv.org/abs/2103.13466
Prove asymptotic freeness of layerwise Jacobians of multilayer perceptrons, using an invariance of the MLP.
https://www.arxiv.org/abs/1912.00827
Analyze the performance of a simple regression model trained on the random features for a random weight matrix and random bias vector, obtaining an exact formula for the asymptotic training error on a noisy autoencoding task. Find that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoencoding task.
Learning Rates as a Function of a Batch Size: A Random Matrix Theory Approach to Neural Network Training
https://www.arxiv.org/abs/2006.09092
Demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian, using spiked field-dependent random matrix theory. Derive an analytical expressions for the maximal learning rates as a function of batch size.
https://www.arxiv.org/abs/2111.00841
Spectral density of the Jacobian is crucial in analyzing robustness, where such Jacobians are modeled using free multiplicative convolutions from Free Probability Theory. Present a reliable and very fast method for computing the associated spectral desnities, based on an adaptive Newton-Raphson scheme.
https://www.arxiv.org/abs/2102.06740
Propose a novel model for the true loss surfaces of neural networks which allows for Hessian spectral densities with rank degeneracy and outliers, and predicts a growing independence of loss gradients as a function of distance in weight-space.
https://www.arxiv.org/abs/1802.09979
Using the tools from free probability theory, prove a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparametrs including the nonlinearity, the weight and bias distirbutions, and the depth.
Analysis of One-Hidden-Layer Neural Networks via the Resolvent Method
https://www.arxiv.org/abs/2105.05115
Investigate spectral density of random feature matrix M = YY^* where Y=f(WX+B), which extends previous result without bias, and show that it is impossible to choose an activation function preserving the layer-to-layer singular value distribution. Use resolvent method using the cumulant expansion, since it is more robust and less combinatorial than the moment method.
https://www.arxiv.org/abs/2109.01785
Using random matrix theory on GCN, show that if the graph is sufficiently random, the GCN fails to benefit from the node feature. then suggest the node feature kernel which solves this problem.
https://www.arxiv.org/abs/2111.15256
Study FIM of one layer hidden network, show that there is three major cluster in eigenvalue distribution, the first eigenvalue is Perron Frobenius eigenvalue, and the cluster of next maximum value's eigenspace is spanned by row vector of first weight, direct sum of first eigenspace and third cluster is spanned by hadamard products of first weight.
A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian Kernel, a Precise Phase Transition, and the Corresponding Double Descent
https://www.arxiv.org/abs/2006.05013
Derive exact asymptotics of random fourier feature regression, showing that RFF Gram matrix does not converge to well known Guassian kernel matrix, but has a tractable behavior with accurate estimates of regression error.
https://www.arxiv.org/abs/2201.04543
Deals with untrained network's input-output Jacobian in infinite width limit, uses another techniques different to previous researches, justifying that the singular value distribution of the Jacobian coincides with analog of the Jacobian with special random but weight independent diagonal matrices.
Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case
https://www.arxiv.org/abs/2103.13466
DNN's Jacobian, both parameter and input are polynomials of layerwise Jacobians, so its asymptotic freeness is crucial for propagating spectral distributions through layers. Proves asymptotic freeness of layerwise jacobians of MLP, with Haar distributed orthogonal matrix initialization.
https://www.arxiv.org/abs/2203.06176
Find that classical GCV estimator accurately predicts generalzation risk even in overparameterized settings, prove that the GCV estimator converges to the generalization risk whenever a random matrix law holds. And apply this theory to explain why pretrained representations generalize better as well as what factors govern scaling laws for kernel regression.
https://www.arxiv.org/abs/2111.01875
Provide an analyical framework that allows to adopt standard initialization strategies, avoid lazy training, and train all layers simultaneously in basic shallow neural network while attaining a desirable subquadratic scaling on the network depth, using Polyak-Lojasiewicz condition, and random matrix theory.
https://www.arxiv.org/abs/2111.13331
Analyze evolutions of weight matrices' spectra, and they are classified to Marchenko-Pastur, Marchenko-Pastur with few bleeding outliers, Heavy tailed spectrum. These are connected to the degree of regularization, and argue that degree depends on the quality of data.
https://www.arxiv.org/abs/2302.00401
Prove the Gaussian universality of the test error of fully connected layer where only readout layer is trainable, in a ridge regression setting. This requires proving a deterministic equivalent for traces of the deep random features sample covariance matrix.