Inaccuracy in randomized SVD at moderate ranks

Hi,

### *Randomized SVD is not accurate*

Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either `TruncatedSVD` or `PCA(svd_solver='randomized')`.  It turns out that this solver can be *pretty bad* at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix with `np.hstack([np.zeros(249,),np.arange(250,501)])` as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.
```
import numpy as np
from sklearn.decomposition import TruncatedSVD

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 50
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(250,501)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(449,),np.arange(450,501)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed@pca_operator.components_ # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
```
```
>>> print(np.mean(L2_sv_error))
55.77929710543534
>>> print(np.mean(F_lowrank_error))
1930.877814905553
```
It is clear that there is a problem

It turns out that we can increase `k` and our estimate gets better

```
L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 100
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
```
```
>>> print(np.mean(L2_sv_error))
7.755560034045646
>>> print(np.mean(F_lowrank_error))
729.3025096061267

```

We can also decrease the rank of the underlying approximation to get better accuracy.  What is happening here is that `randomized_svd` gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. As `n_components` gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.

```
L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 1
r = 250
lowrank_r = 1
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
print(np.mean(L2_sv_error))
print(np.mean(F_lowrank_error))
```

```
>>> print(np.mean(L2_sv_error))
10.74597935589037
>>> print(np.mean(F_lowrank_error))
772.1048200820816
```
### *We can make the algorithm more accurate*
It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc.  However, there are a few parameters that can be tweaked in the inner function of randomized svd, `sklearn.utils.extmath.randomized_svd` to make things more accurate. The biggest one is `n_oversamples`, and then `n_iters`

### *How to change graphtools*
I propose that we replace all calls to PCA and Truncated SVD with explicit calls to `randomized_svd` and we set sensible `n_oversamples` as a factor of the requested `n_pca`. The default is not very good. The sklearn documentation suggests for a rank `k` matrix `n_oversamples` should be `2*k-n_components` or just simply `n_components` when `n_components >= k`, but I have found for hard problems this is not enough. We can also add an `svd_kwargs` keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.

@scottgigante @dburkhardt 





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccuracy in randomized SVD at moderate ranks #58

Randomized SVD is not accurate

We can make the algorithm more accurate

How to change graphtools

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inaccuracy in randomized SVD at moderate ranks #58

Description

Randomized SVD is not accurate

We can make the algorithm more accurate

How to change graphtools

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions