Skip to content

Latest commit

 

History

History
107 lines (54 loc) · 6.98 KB

Random Matrix Theory.md

File metadata and controls

107 lines (54 loc) · 6.98 KB

Products of Many Large Random Matrices and Gradients in Deep Neural Networks

https://www.arxiv.org/abs/1812.05994

Given products of random matrices when the number of terms and the size of the matrices tend to infinity, show that logarithm of L2 norm of such a product applied to any fixed vector is asymptotically Gaussian.

Nonlinear random matrix theory for deep learning

https://proceedings.neurips.cc/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf

Show that pointwise nonlinearities can be incorporated into a standard method of proof in random matrix theory known as the moments method.

Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case

https://www.arxiv.org/abs/2103.13466

Prove asymptotic freeness of layerwise Jacobians of multilayer perceptrons, using an invariance of the MLP.

A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

https://www.arxiv.org/abs/1912.00827

Analyze the performance of a simple regression model trained on the random features for a random weight matrix and random bias vector, obtaining an exact formula for the asymptotic training error on a noisy autoencoding task. Find that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoencoding task.

Learning Rates as a Function of a Batch Size: A Random Matrix Theory Approach to Neural Network Training

https://www.arxiv.org/abs/2006.09092

Demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian, using spiked field-dependent random matrix theory. Derive an analytical expressions for the maximal learning rates as a function of batch size.

Free Probability, Newton lilypads and Jacobians of neural networks

https://www.arxiv.org/abs/2111.00841

Spectral density of the Jacobian is crucial in analyzing robustness, where such Jacobians are modeled using free multiplicative convolutions from Free Probability Theory. Present a reliable and very fast method for computing the associated spectral desnities, based on an adaptive Newton-Raphson scheme.

Appearance of Random Matrix Theory in Deep Learning

https://www.arxiv.org/abs/2102.06740

Propose a novel model for the true loss surfaces of neural networks which allows for Hessian spectral densities with rank degeneracy and outliers, and predicts a growing independence of loss gradients as a function of distance in weight-space.

The Emergence of Spectral Universality in Deep Networks

https://www.arxiv.org/abs/1802.09979

Using the tools from free probability theory, prove a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparametrs including the nonlinearity, the weight and bias distirbutions, and the depth.

Analysis of One-Hidden-Layer Neural Networks via the Resolvent Method

https://www.arxiv.org/abs/2105.05115

Investigate spectral density of random feature matrix M = YY^* where Y=f(WX+B), which extends previous result without bias, and show that it is impossible to choose an activation function preserving the layer-to-layer singular value distribution. Use resolvent method using the cumulant expansion, since it is more robust and less combinatorial than the moment method.

Node Feature Kernels Increase Graph Convolutional Network Robustness

https://www.arxiv.org/abs/2109.01785

Using random matrix theory on GCN, show that if the graph is sufficiently random, the GCN fails to benefit from the node feature. then suggest the node feature kernel which solves this problem.

Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks

https://www.arxiv.org/abs/2111.15256

Study FIM of one layer hidden network, show that there is three major cluster in eigenvalue distribution, the first eigenvalue is Perron Frobenius eigenvalue, and the cluster of next maximum value's eigenspace is spanned by row vector of first weight, direct sum of first eigenspace and third cluster is spanned by hadamard products of first weight.

A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian Kernel, a Precise Phase Transition, and the Corresponding Double Descent

https://www.arxiv.org/abs/2006.05013

Derive exact asymptotics of random fourier feature regression, showing that RFF Gram matrix does not converge to well known Guassian kernel matrix, but has a tractable behavior with accurate estimates of regression error.

Eigenvalue Distribution of Large Random Matrices Arising in Deep Neural Networks: Orthogonal Case

https://www.arxiv.org/abs/2201.04543

Deals with untrained network's input-output Jacobian in infinite width limit, uses another techniques different to previous researches, justifying that the singular value distribution of the Jacobian coincides with analog of the Jacobian with special random but weight independent diagonal matrices.

Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron: The Haar Orthogonal Case

https://www.arxiv.org/abs/2103.13466

DNN's Jacobian, both parameter and input are polynomials of layerwise Jacobians, so its asymptotic freeness is crucial for propagating spectral distributions through layers. Proves asymptotic freeness of layerwise jacobians of MLP, with Haar distributed orthogonal matrix initialization.

More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize

https://www.arxiv.org/abs/2203.06176

Find that classical GCV estimator accurately predicts generalzation risk even in overparameterized settings, prove that the GCV estimator converges to the generalization risk whenever a random matrix law holds. And apply this theory to explain why pretrained representations generalize better as well as what factors govern scaling laws for kernel regression.

Subquadratic Overparameterization for Shallow Neural Networks

https://www.arxiv.org/abs/2111.01875

Provide an analyical framework that allows to adopt standard initialization strategies, avoid lazy training, and train all layers simultaneously in basic shallow neural network while attaining a desirable subquadratic scaling on the network depth, using Polyak-Lojasiewicz condition, and random matrix theory.

Implicit Data-Driven Regularization in Deep Neural Networks under SGD

https://www.arxiv.org/abs/2111.13331

Analyze evolutions of weight matrices' spectra, and they are classified to Marchenko-Pastur, Marchenko-Pastur with few bleeding outliers, Heavy tailed spectrum. These are connected to the degree of regularization, and argue that degree depends on the quality of data.

Deterministic equivalent and error universality of deep random features learning

https://www.arxiv.org/abs/2302.00401

Prove the Gaussian universality of the test error of fully connected layer where only readout layer is trainable, in a ridge regression setting. This requires proving a deterministic equivalent for traces of the deep random features sample covariance matrix.