Hi,
Randomized SVD is not accurate
Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either TruncatedSVD or PCA(svd_solver='randomized'). It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix with np.hstack([np.zeros(249,),np.arange(250,501)]) as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.
import numpy as np
from sklearn.decomposition import TruncatedSVD
L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 50
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
x = np.random.randn(m,n)
ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
sx = np.hstack([np.zeros(r-1,),np.arange(250,501)])[::-1] # the ground truth singular values we will use
sx_lowrank = np.hstack([np.zeros(449,),np.arange(450,501)])[::-1] # the spectrum of the best rank-50 approximation
y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
pca_operator = TruncatedSVD(k)
pca_operator.fit(y)
sy = pca_operator.singular_values_ # our estimate of the singular values
y_transformed = pca_operator.transform(y)
y_reconstructed = y_transformed@pca_operator.components_ # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy) # how close did we get to estimating the top 50 singular values in L2?
F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
55.77929710543534
>>> print(np.mean(F_lowrank_error))
1930.877814905553
It is clear that there is a problem
It turns out that we can increase k and our estimate gets better
L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 100
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
x = np.random.randn(m,n)
ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
pca_operator = TruncatedSVD(k)
pca_operator.fit(y)
sy = pca_operator.singular_values_ # our estimate of the singular values
y_transformed = pca_operator.transform(y)
y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
7.755560034045646
>>> print(np.mean(F_lowrank_error))
729.3025096061267
We can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that randomized_svd gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. As n_components gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.
L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 1
r = 250
lowrank_r = 1
m = 1000
n = 500
for ix in range(10):
x = np.random.randn(m,n)
ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
pca_operator = TruncatedSVD(k)
pca_operator.fit(y)
sy = pca_operator.singular_values_ # our estimate of the singular values
y_transformed = pca_operator.transform(y)
y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
print(np.mean(L2_sv_error))
print(np.mean(F_lowrank_error))
>>> print(np.mean(L2_sv_error))
10.74597935589037
>>> print(np.mean(F_lowrank_error))
772.1048200820816
We can make the algorithm more accurate
It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd, sklearn.utils.extmath.randomized_svd to make things more accurate. The biggest one is n_oversamples, and then n_iters
How to change graphtools
I propose that we replace all calls to PCA and Truncated SVD with explicit calls to randomized_svd and we set sensible n_oversamples as a factor of the requested n_pca. The default is not very good. The sklearn documentation suggests for a rank k matrix n_oversamples should be 2*k-n_components or just simply n_components when n_components >= k, but I have found for hard problems this is not enough. We can also add an svd_kwargs keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.
@scottgigante @dburkhardt
Hi,
Randomized SVD is not accurate
Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either
TruncatedSVDorPCA(svd_solver='randomized'). It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix withnp.hstack([np.zeros(249,),np.arange(250,501)])as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.It is clear that there is a problem
It turns out that we can increase
kand our estimate gets betterWe can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that
randomized_svdgets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. Asn_componentsgets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.We can make the algorithm more accurate
It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd,
sklearn.utils.extmath.randomized_svdto make things more accurate. The biggest one isn_oversamples, and thenn_itersHow to change graphtools
I propose that we replace all calls to PCA and Truncated SVD with explicit calls to
randomized_svdand we set sensiblen_oversamplesas a factor of the requestedn_pca. The default is not very good. The sklearn documentation suggests for a rankkmatrixn_oversamplesshould be2*k-n_componentsor just simplyn_componentswhenn_components >= k, but I have found for hard problems this is not enough. We can also add ansvd_kwargskeyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.@scottgigante @dburkhardt