Spectral Algorithms
Notes and Summary
Table of Contents
- 1. Chapter 1: Best Fit Subspace & k-variance problem
- 2. Chapter 2: Mixture Model
- 3. Chapter 3: Probabilistic Spectral Clustering
- 4. Chapter 4: Recursive Spectral Clustering
- 5. Chapter 5: Optimization via Low-Rank Approximation
- 6. Chapter 6: Matrix Approximation by Random Sampling
- 7. Chapter 7: Adaptive Sampling Methods
- 8. Chapter 8: Extensions of SVD
\pagebreak Book by Ravindran Kannan & Santosh Vempala [pdf]
Eigenvalues, Eigenvectors, Singular values and Singular vectors are Spectral Parameters. Any algorithm that uses spectral parameters to do something useful is called Spectral Algorithm.
The first 5 chapters talk about application of Spectral Methods for different kinds of problems:
- Clustering (k-variance problem)
Classifying samples from Mixture Models
For mixture of Gaussians, SVD gives the current best gurantees
Probabilistic Spectral Clustering
Convert clustering problem to a graph problem.
The graph problem is to find underlying generative random graph from a sample of that graph.
- Recursive Spectral Clustering (Partition a weighted graph into \(k\) clusters)
- Combinatorial Optimization (Max-rCSP problem)
The next 2 chapters talk about randomized algorithm to find spectral properties used in previous chapters (i.e. finding top \(k\) singular vectors to find best rank \(k\) approximation)
Finally, the last chapter talks about extension of Spectral methods from matrices to tensors
1. Chapter 1: Best Fit Subspace & k-variance problem
1.1. Best Fit Subspace
Span of top-\(k\) singular vectors is the best fit \(k\) -dimensional subspace for the rows of \(A\)
E.g. If we view rows of a matrix as point in a space, then best fit line to a give set of points would be the top-1 singular vector.
In this way finding Best fit k-Subspace is a generalization of Linear least square problem.
1.1.1. "Best"
"Best" is interms of Frobenius norm. Or equivalently minimizing the perpendicular distance of original points to the subspace.
Theorem 1.4: Among all rank k matrices D, the matrix \(A_k = \sum_{i=1}^k \sigma_i u_i v_i^T\) is the one which minimizes \(||A-D||_F^2\)
SVD finds the best subspace in polynomial time, but for other measure of "best" (e.g. sum of distance, max distance) no polynomial-time algorithms are known. [Page 8]
1.2. k-variance problem
k-variance problem: Given a set of \(m\) points in \(\mathbb{R}^n\) find \(k\) general points (which are the cluster centers) such that the sum of square distance from each point to its nearest cluster center is minimized.
Relaxed Problem (Continuous Clustering Problem): Find a subspace \(V\) of \(\mathbb{R}^n\) of dimension at most \(k\) which minimizes the distance from each point to the subspace. First \(k\) vectors of SVD of \(A\) solve the relaxed problem. Let \(V_k\) be the optimal subspace.
We can show that by first projecting the points \(A\) to \(V_k\) and solving the k-variance problem for the projected points in the lower dimensional space, we get a solution that is no more than 2 times bad than the best solution. Since \(k\) is smaller than the original dimension of the problem (but remember the number of points is the same), it is more computationally efficient to solve the problem in \(k\) dimensions.
2. Chapter 2: Mixture Model
The problem is to classify points sampled from a mixture model with \(k\) component distribution. For this chapter the component distributions are Gaussian.
If the means of the component distribution are separated by small number of standard deviations, then the projection of the distribution into the subspace spanned by the means gives a well separated mixture and the dimension would now be at most \(k\). [Lemma 2.1]
Then classification may be done by a distance based method: Compute pair wise distance between each points and group all points together whose distance is less than some threshold. This threshold is choosen so as to form \(k\) components. [Page 19]
2.1. General Algorithm
The general approach is as follows:
- Compute the SVD of the sample matrix
- Project the points into the best fit \(k\) subspace
- Perform classification in that subspace
Since the \(k\) subspace is lower dimensional, the problem can be solved more computationally efficiently in \(k\) dimensions than compared to original problem, e.g. by distance based classification.
2.2. Some facts about Spherical Gaussians
- Best fit 1-dimensional subspace for a spherical Gaussian distribution is a vector that passes through the mean of that Gaussian.
- Due to symmetry, best fit subspace of dimension 2 or more is any subspace contining the mean.
- Best fit \(k\) subspace of a \(k\) Gaussian mixure model are those one that contains the mean of the Gaussians.
So for mixture of Spherical Gaussians the projection into best fit \(k\) subspace is same as projection into subspace spanned by means, and thus the projection results in an well separated mixture.
2.3. General Distribution
For nonspherical gaussians and in general, the best fit k-subspace doesn't pass through the mean (e.g. parallel pancakes [Section 2.5] problem). This property only holds for mixture of Weakly Isotropic distributions. [Exercise 2.3]. So, for general distributions, the subspace that maximizes squared projection is not the best subspace for classification.
Still, even for mixture of general distributions the means of the distribution don't move too much after projection into SVD subspace [Theorem 2.6; Pg. 23]. Meaning the SVD subspace is "close" to the subspace spanned by component means. And the general algorithm is useful.
However this still doesn't solve parallel pancake problem as the means may overlap after projection.
2.4. SVD of Samples
SVD subspace of Gaussian mixture is useful for classification. However we don't have the distributions and but only the samples from the mixture distribution. So, SVD would be useful if:
- SVD of sample matrix is not far from SVD of the mixture distribution
Also to use classification based on distance, we require that the pair wise distance between same component samples to be not high.
These properties are satisfied by log-concave distributions. And since most commonly seen distributions are log-concave, this is useful.
2.5. Affine Invariant Algorithm
For the above general algorithm, an affine transformation could change a well separated problem into one that is not well-separated. So we need an affine invariant algorithm:
- Use the sample to find an affine transformation that make the distribution nearly isotropic. i.e.
- Shift the center the distribution to origin
- Scale/Rotate i.e. transform so as to make the covariance matrix identity
- Reweight using a spherical gaussian
- Find the inter-mean direction and project along that direction
- Classify the projected points based on distances
This algorithm only needs the mixture to be hyperplane separable.
I didn't understand the proof of the algorithm, and thus I am not sure how it works. But somehow,
- Making the distribution of samples isotropic makes it so that the subspace spanned by the mean of the component is same as the Fischer subspace.
- In Fischer subspace the overlap of the components is small in every direction.
- Rest of the algorithm (reweighting, finding inter-mean direction) is to find the directions close to this Fischer subspace (or the subspace spanned by the means)
- After finding the subspace, classification can be done based on distances.
2.6. Benefits
- Classic algorithms for classifying mixture models (e.g. EM or local search heuristics) can get stuck on local minima or take long time to converge.
- This line of algorithms provide rigorous guarantees
3. Chapter 3: Probabilistic Spectral Clustering
The problem is to find the underlying partition of the vertices of a Random Graph. Where edge between vertices in two partition (\(r\) and \(s\)) of graph appears with probability \(p_{rs}\).
We are given a realization of that random graph, as a adjacency matrix \(A\), and we need to find the generative model \(\mathbb{E} A\).
This problem can also be viewed as a mixture model but the points from even the same component distributions are far far way compared to gaussian mixture model. And thus the nature of the problem is different. However spectral methods can still be applied [Page 31].
3.1. Full Independence
We assume Full Independence: Edge of the graph are mutually independent random variables.
Then we can use Random Matrix theory to show a bound on \(||A - \mathbb{E}A||\). Finally we can argue that \(A_k\), the best rank \(k\) approximation to \(A\) is in fact close to \(\mathbb{E}A\) in spectral norm and used this to cluster "most" points correctly.
3.1.1. Details
Due to full independence, the matrix \(A - \mathbb{E} A\) has random independent entries with mean zero. And we can utilize a theorem (by Wigner, Furedi, Komlos, Theorem 3.1):
\begin{equation*} ||A - \mathbb{E} A||_2 \leq c \nu \sqrt{n} \end{equation*}where, \(\nu < 1\) is the highest standard deviation and \(A\) is symmetric random matrix
This bound tells that the rows are not correlated. Because higher value of top eigenvalue (\(= \max_{|x|=1}||(A-\mathbb{E}A})x||\)) implies higher correlation in some direction \(x\). But the top eignevalue is only within some constant \(c\) of the norm of each row \(O(\nu\sqrt{n})\).
Spectral norm and Frobenius norm are related by the relation: \(|| A_k - B ||_F^2 \leq 5k ||A - B ||_2^2\) when \(B\) has rank \(k\)
So,
\begin{equation*} ||A_k - \mathbb{E} A||_F \leq c \nu^2 n k \end{equation*}We now have a bound on Frobenius norm of \(A - \mathbb{E}A\). Since there are \(n\) rows and since they are not correlated this means that with high probablity the norm of each row is also bounded as:
\(|(A_k)_i - \mathbb{E} A_i| \le c \nu^2 k\)
i.e. the \(A_k\) obtained by SVD of the incidence matrix \(A\) approximates the underlying generative distribution.
3.2. Basic Algorithm
If we assume a separability condition which says that the mean of the component distributions are separated by some constant \(20 c \nu k\) independent of \(n\) then we have an algorithm:
- Find the top \(k\) singular vectors of \(A\)
- Project \(A\) to the space of the best fit \(k\) -subspace
- Pick \(k\) random points (our cluster centers) in some way such that they are far away from each other
- Now assign points to the clusters if they are within some distance \(4c\nu k\).
Cleanup wrongly classified vertices (this step is involved and requires some assumptions)
It is an open problem to give a simple, optimal clean-up algorithm for probabilistic spectral clustering.
3.3. Deterministic Assumptions
Under some assumptions: [Section 3.2]
- Boundedness: The points \(A_i\) are not too far from cluster center \(\mu_r\). i.e. \(|A_i - \mu_r| < M\)
- Correct center closest: For each point, the correct center is the closest center.
- No small cluster: The size of cluster (i.e. number of vertices in each cluster \(T_r\)) is in the order of total number of vertices. \(|T_r| > m_o \in \Omega(n)\)
Use an approximate algorithm to find cluster centers \(v_1, v_2, ..., v_r\) then, [Section 3.2.1]
\begin{equation*} S_r = \{i: \forall s \ |(A_k)_i - v_r| \leq | (A_k)_i - v_s| \} \end{equation*}is the exact set of cluster vertices.
4. Chapter 4: Recursive Spectral Clustering
The problem is to partition a graph into clusters. Compared to previous section where the problem was discrete (i.e. graph was unweighted), we now have weighted graphs. Key step is to find an approximate algorithm to find a cut that gives minimum "conductance" to the partitioned subsets of the vertices.
If a cut separates graph into two subset of vertices \(S\) and \(S' = V \textbackslash S\), Conductance of \(S\) is the ratio of two quantities:
- total weight of edges going out of the subset \(S\)
- min of the total weight of edges within the subset \(S\) or \(S'\)
So a good "min conductance" cut has each subset fairly well connected and the cut passes through a small number (weighted sum) of edges as possible.
4.1. Spectral Heuristic for approx min conductance cut
The following algorithm gives an approximation to minimum conductance cut using a heuristic based on the second largest eigenvector of the adjacency matrix:
- Normalize adjacency matrix so that each row sum is 1
- Find the second largest eigenvector
- Order the vertices according to their components in this eigenvector
- Find the minimum conductance cut among the cuts given by this ordering
[Note: this is different than using Fiedler vector which is the second smallest eigenvector of the Laplacian]
4.2. Recursive Clustering
Given an algorithm \(\mathcal{A}\) to find an approximate minimum conductance cut of the graph \(G\)
- Find a cut that approximates the minimum conductance cut in \(G\)
- If the conductance of the cut obtained is below a threshold, recurse on the pieces induced by the cut
There is a theorem [Theorem 4.3] that provides gurantees on final quality of cut given an approximate algorithm. Quality of cut is measured using two parameters:
- \(\alpha\): The conductance of each cluster is at least \(\alpha\)
- \(\epsilon\): The total weight of inter-cluster edges is at most an \(\epsilon\) fraction of the total weight of all edges
4.3. Recursive Spectral Clustering
Using the Recursive clustering algorithm with Spectral heuristic as the approximate algorithm we get a spectral clustering algorithm.
Using Theorem 4.3 along with property of spectral heuristic cut, we can find an formula for the quality of the cut that this algorithm wil find [See Corollary 4.6 for the exact formula].
5. Chapter 5: Optimization via Low-Rank Approximation
(I, for most part, didn't understant this chapter and the corresponding chapter 8. The conditions upon which the algorithms worked were too densely presented.)
The problem is to solve a boolean constraint satisfaction problem. With \(r\) variables per constraint. A generalization of this problem is the weighted MAX-rCSP problem. Where the objective is to maximize the weighted sum of the satisfied constraints.
This problem can be represented as a tensor \(A\) with \(r\) dimensions each with \(2n\) components. i.e. as a \(2n \times ... \times 2n = (2n)^r\) multi-dimensional array.
Two ideas are important here:
- Any \(r\) -dimensional array \(A\) can be approximated as a sum of small number of rank-1 tensors.
- Such approximation can be found.
There are theorems that gurantee that there exits polynomial time approximation scheme for any core-dense weighted MAX-rCSP.
Where core-dense condition is defined in Section 5.1.
When such conditions is satisfied, the solution algorithm takes following steps:
- Scale the tensor A: \(B = D^{-1} A D^{-1}\)
- Find low rank approximation to the scaled tensor \(\hat{B} \approx B\)
- Scale the approximation \(\hat{A} = D \hat{B} D\)
- Solve MAX-rCSP for \(\hat{A}\)
But since \(\hat{A}\) is in much lower dimension, it is easier to solve the problem there.
6. Chapter 6: Matrix Approximation by Random Sampling
6.1. Low Rank Approximation
[From book RandNLA - Section 4 [arXiv]]
The task is to find a low rank matrix (\(\hat{A}\) ) that approximates the target matrix (\(A\)). Forbenius norm or spectral norm (e.g. 2-norm) or other norms may be used.
For practical application we find the low-rank matrix \(\hat{A}\) in a suitably factored representation. We can impose structural constraints on such factored representation (e.g. orthogonal, diagonal, or submatrix of \(A\) ). Depending on that two types of factored representations can be roughly classified:
Spectral Decomposition (low rank SVD or Eignedecomposition)
\(A \approx \hat{A} = U \Sigma V\)
- Submatrix Oriented Decomposition
- CUR decomposition
6.2. Matrix-Vector Product
Matrix vector product can be seen as sum of \(n\) vectors as:
\begin{equation*} A v = \sum_{j=1}^N v_j A^{(j)} \end{equation*}Where \(A^{(j)}\) is the \(j\) -th column of \(A\). So \(Av\) is weighted sum of columns of \(A\).
This sum can be approximated by sampling the columns of \(A\). If \(j\) -th column is sampled with probability \(p_j\), then an unbiased estimated of \(A\) is
\begin{equation*} X = \frac {A^{(j)} v_j} {p_j} \end{equation*}i.e. \(\mathbb{E} X = A\)
Using length-squared distribution (\(LS_{col}(A)\) i.e. \(p_j = ||A^{(j)}||^2 / ||A||_F^2\)) for sampling the columns, the variance of the estimate \(X\) of \(A\), is
\begin{equation*} \textrm{Var}(X) \leq ||A||_F^2 ||v||^2 \end{equation*}Similarly sometimes we might use an approximate length-squared distribution \(LS_{col}(A, c)\) where \(p_j \geq c ||A^{(j)}||^2 / ||A||_F^2\) \(c \in (0, 1]\). This also gives similar variance
\begin{equation*} \textrm{Var}(X) \leq \frac 1 c ||A||_F^2 ||v||^2 \end{equation*}We can reduce the variance by using \(s\) independent identically distributed samples and averaging them:
\begin{equation*} \textrm{Var}(\frac 1 s \sum X) \leq \frac 1 {cs} ||A||_F^2 ||v||^2 \end{equation*}6.3. Matrix-Matrix Product
Matrix Matrix product can also be similarly approximated by sampling.
Let \(A\) be a \(m \times n\) matrix and \(C\) be a \(m \times s\) matrix formed by sampling \(s\) columns of \(A\) and scaling each column appropriately with the sampling distribution probabilities \(p_j\) and sample size \(s\), then
\begin{equation*} \mathbb{E}\ CC^T = AA^T \end{equation*}Also, \(CC^T\) and \(AA^T\) are close to each other [Theorem 6.4]:
\begin{equation*} ||CC^T - AA^T||_F^2 \leq \frac 1 {cs} ||A||_F^4 \end{equation*}This implies that the singular values are also close:
\begin{equation*} \sum_t (\sigma_t(CC^T) - \sigma_t(AA^T))^2 \leq ||CC^T - AA^T||_F^2 \end{equation*}The singular values of \(AA^T\) are just the square of singular values of \(A\).
So, using a sample of columns of \(A\) we can estimate its singular values. We can also estimate the singular vectors. Different algorithms estimate it using different methods.
6.4. Fast-SVD Algorithm
By sampling the columns of A and finding singular vectors of the matrix formed from sampled columns we can find rank \(k\) approximation to \(A\)
- Sample \(s\) columns of \(A\) from length squared distribution to form \(C\)
- Find top \(k\) left singular vectors of \(C\): \(u^{(1)}, u^{(2)}, ..., u^{(k)}\)
- Output \(\sum_i^k u^{(i)}u^{(i)^T} A\) as rank \(k\) approximation to \(A\)
This algorithm finds a rank \(k\) approximation \(\tilde{A}\) such that:
\begin{equation*} \mathbb{E}\ (|| A - \tilde{A} ||_F^2) \leq || A - A_k ||_F^2 + 2 \sqrt{\frac k s} ||A||_F^2 \end{equation*}6.5. Constant Time SVD Algorithm
By sampling the columns of A and then sampling rows of the matrix formed from sampled columns we get a square matrix of a fixed size (independent of the size of \(A\)) and then finding singular vectors of that matrix we can find rank \(k\) approximation to \(A\).
- Pick a sample of \(s \propto \frac {k^5} {\epsilon^4}\) columns of \(A\) according to \(LS_{col}(A)\) distribution and scale to form \(m \times s\) matrix \(C\)
- Sample a set of \(s\) rows from \(C\) to form a \(s \times s\) matrix \(W\)
- Find the SVD of \(W^TW\):
- Compute
- Finally, return approximation to \(A\) as:
where \(l\) is given by some contraint on \(Cv^{(l)}\)
The approximation returned is accurate upto:
\begin{equation*} \mathbb{E}\ ||A - \hat{A}||_F^2 \leq \sum_{t=k+1}^n \sigma_t^2 (A) + \epsilon ||A||_F^2 \end{equation*}6.6. CUR: Interpolative approximation
The approximation algorithm previously presented didn't preserve the sparsity of \(A\). But we can approximate \(A\) as \(A = CUR\) where
- \(C\) is a \(m \times s\) matrix formed by sampling \(s\) columns of \(A\)
- \(R\) is a \(s \times n\) matrix formed by sampling \(s\) rows of \(A\) and
- \(U\) is a \(s \times s\) matrix which can be computed based on \(C\) and \(R\) [See Theorem 6.11 for the formula]
This representation has the benefit that
- sparsity of \(A\) is maintained
- computing \(CUR\) is fast and we need only two passes over \(A\)
6.7. Applications
Streaming Model for limited memory
Sampling the rows or columns can be done even when matrix rows/columns arrive one at a time in any order.
- When whole matrix \(A\) is unavailable (e.g. user survey data for Recommendation system), we can use the sample to represent the whole matrix \(A\).
7. Chapter 7: Adaptive Sampling Methods
Fast-SVD algorithm and Constant time SVD algorithm produce rank \(k\) approximation within some additive error \(\epsilon||A||_F^2\) which is not so great because \(\epsilon||A||_F^2\) can be anything out of our control.
However that approximation is obtained by making two pass over the data (first to compute the distribution and second to sample columns). If we can make multiple passes over the data and sample more strategically, we can obtain better error bounds which are multiplicative instead of additive. i.e. within \((1+\epsilon) ||A - A_k||_F^2\).
7.1. Iterative Fast SVD
We find rank \(k\) subspace for \(A\) iteratively, after each iteration we sample a new rows from a distribution which probability proportional to the squared distance to the previous subspace.
Let \(\pi_{S}(A)\) be the projection of \(A\) with rows in the row span of \(S\). Then, repeat for \(t\) times:
- \(E = A - \pi_{S}(A)\) (where for the first iteration take \(E = A\))
Sample \(s\) rows according to a distribution that assigns probabilities as:
\begin{equation*}
pj = ||E(j)||2F / ||E||F2 \end{equation*}
- Add those \(s\) rows to the collection of sampled rows \(S\)
Finally, returns top \(k\) right singular vector for \(\pi_S(A)\)
This algorithm make \(2t\) passes over the data and computes an rank \(k\) approximation within and additive error of \(\epsilon^t ||A||_F^2\) which is better than the additive error of Fast SVD.
7.2. Volume Sampling
Instead of using Length squared distribution if we sample set of \(k\) rows (\(S\)) in proportion to the volume of the \(k\) -simplex they form with origin (\(Vol(\Delta(S))\)), then we get an approximation within a factor of \((k+1)\).
Let \(S\) be sample of \(k\) rows, sampled with probability:
\begin{equation*} P_S = \frac{Vol(\Delta(S))} {\sum_{T: |T|=k} (Vol(\Delta(S)))} \end{equation*}i.e. the probability the volume of set \(S\) normalized by sum of volumes of all possible set of \(k\) rows.
Then \(\tilde{A}_k\) which is projection of \(A\) into \(S\) gives an rank \(k\) approximation within:
\begin{equation*} \mathbb{E}\ (|| A - \tilde{A} ||_F^2) \leq (1+k)|| A - A_k ||_F^2 \end{equation*}However sampling according to volume distribution is not straightforward. And it is still an open question if exact volume sampling can be done in polynomial time (in \(n\) and \(k\)) or not.
7.3. Isotropic Random Projection
- Take a random matrix \(S\) of size \(l \times m\) with entries from a bernoulli distribution with mean zero.
- Compute projection of columns of \(A\) onto \(S\). i.e. compute \(B = SA\)
- Project \(A\) to the span of rows of \(B\) to get \(\tilde{A}\)
- Compute \(\tilde{A}_k\) the rank \(k\) approximation to \(\tilde{A}\)
This gives and approximation to \(A_k\) within \((1 + \epsilon)\) error:
\begin{equation*} \mathbb{E}\ (|| A - \tilde{A}_k ||_F^2) \leq (1+\epsilon)|| A - A_k ||_F^2 \end{equation*}when \(l = Ck/\epsilon\)
[Note: The book says to take random matrix of size \(l \times n\) which doesn't make sense]
So projecting \(A\) to a lower dimension space using a random matrix give us an "projection" of \(A\), the \(\tilde{A}\) matrix, whose rank \(k\) approximation is a good rank \(k\) approximation for \(A\).
8. Chapter 8: Extensions of SVD
This chapter deals with equivalent of SVD but for tensor. i.e. representing tensors as a sum of small number of rank \(k\) tensors.
8.1. Tensor decomposition
For any tensor \(A\) there exists \(k\) rank-1 tensors (\(B_1, B_2, ..., B_k\))such that
\begin{equation*} ||A - (B_1 + B_2 + ... + B_k)||_2 \leq \epsilon ||A||_F \end{equation*}for some properly defined \(\epsilon\) depending upon \(k\)
The idea for finding the approximation is roughly (I did't understand this much)
- to sample along some lines in the tensor \(A\) in with some probability distribution in proportion to the square of entries along those lines
- Use that sample to reduce the dimension of tensor, and apply the algorithm recursively
- Then output the set of vectors sampled
8.2. Isotropic PCA
This section describes some way to do an equivalent of PCA on tensor.