Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPCA did not converge, numpy.linalg.LinAlgError: SVD did not converge #15996

Closed
madanvnera opened this issue Apr 16, 2020 · 8 comments
Closed

Comments

@madanvnera
Copy link

madanvnera commented Apr 16, 2020

Incremental PCA is consistently giving convergence issue with dataframe of 18000, 18000

Reproducing code example:

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA, IncrementalPCA

df_data=pd.read_csv("/home/ubuntu/df_data_18000_18000_data1.csv")
df_data.set_index('Unnamed: 0', inplace=True)
df_data=df_data.astype('int8')
ipca = IncrementalPCA(n_components=3600, batch_size=3600)
data_ipca = ipca.fit_transform(df_data)
total_explained_variances_ratio = sum(list(ipca.explained_variance_ratio_))
print("Total explained variance in IPCA is {}".format(total_explained_variances_ratio))
df = pd.DataFrame(data_ipca, index=list(df_data.index))
print("Size of vector space after IncrementalPCA {}".format(df.shape))
[df_data_18000_18000_data2.csv.zip](https://github.com/numpy/numpy/files/4486707/df_data_18000_18000_data2.csv.zip)

Error message:

Trackback:
Traceback (most recent call last):
File "ipca_script.py", line 8, in
data_ipca = ipca.fit_transform(df_data)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/sklearn/base.py", line 553, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/sklearn/decomposition/incremental_pca.py", line 201, in fit
self.partial_fit(X[batch], check_input=False)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/sklearn/decomposition/incremental_pca.py", line 279, in partial_fit
U, S, V = linalg.svd(X, full_matrices=False)
File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/scipy/linalg/decomp_svd.py", line 132, in svd

Numpy/Python version information:

1.17.4 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0]

print(sklearn.version)
0.21.2

@charris
Copy link
Member

charris commented Apr 16, 2020

What problem are you trying to solve with such large matrices?

@madanvnera
Copy link
Author

madanvnera commented Apr 16, 2020

Yes I understand, it is large matrix. The number of features is dynamic and can be really large in some cases (as discussed above) in our problem. We need to find the principal features and we are okay to have less variance ratio here. We are using Incremental PCA for memory optimization. Do you think, IPCA is not good for a large number of feature sets? I do not understand why does it give convergence issue . It should able to give features set with a lower variance ratio. IPCA does not have an option for variance ratio. LAPACK implementation should be triggered if it does not converge

@seberg
Copy link
Member

seberg commented Apr 16, 2020

We do have some 64bit blas support now, I am not quite sure when that is active by default (i.e. in our wheels, anaconda?) But you may get away with this if you update to a newer numpy version. As a start if you want to look into it: gh-15012 and gh-15114

@charris
Copy link
Member

charris commented Apr 16, 2020

One thing worth looking at is if the approach you are using can be simplified, that is, improve the algorithm. The large array is somewhat suspicious in that regard. That is why I was asking for more details on what you were doing.

@madanvnera
Copy link
Author

Yes. The original matrix has this large array. We need to use Clustering on this data. Before we give this for the clustering algorithm, we use principal component analysis.
@charris any idea what can cause convergence issue here?.

Also, are you planning to add more options in Incremental PCA for SVD Solver {‘auto’, ‘full’, ‘arpack’, ‘randomized’}?

@charris
Copy link
Member

charris commented Apr 16, 2020

Incremental PCA is a scikit-learn thing, not numpy. My naive thought is that incremental may not be the best approach here. What I am curious about is how the matrix is produced.

@madanvnera
Copy link
Author

Thanks for the reply. sorry, I just realized it, I am on NumPy GitHub. I should be asking this on sci-kit-learn.
Here I am more interested in knowing when and why this error can occur in NumPy module numpy.linalg.LinAlgError. I see this multiple times intermediately in PCA as well. Thanks in advance.

@rossbar
Copy link
Contributor

rossbar commented Jul 12, 2020

Closing as it seems the conversation has moved to another forum. Feel free to reopen if I've missed something.

@rossbar rossbar closed this as completed Jul 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants