TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.
See original GitHub issuefrom bertopic import BERTopic
docs = ["Hi i'm a doc", "i'm also a doc", "I'm a document", "this is an apple", "yet another topic"]
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)
Running BERTopic model… C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py:1593: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead. warnings.warn("k >= N for N * N square matrix. "
TypeError Traceback (most recent call last) <ipython-input-28-7dcfeabe3647> in <module> 1 print(‘Running BERTopic model…’) 2 # topics, _ = tqdm(topic_model.fit_transform(docs)) ----> 3 topics, _ = tqdm(topic_model.fit_transform(short_docs))
C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\bertopic_bertopic.py in fit_transform(self, documents, embeddings) 279 280 # Reduce dimensionality with UMAP –> 281 umap_embeddings = self._reduce_dimensionality(embeddings) 282 283 # Cluster UMAP embeddings with HDBSCAN
C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\bertopic_bertopic.py in _reduce_dimensionality(self, embeddings) 1101 low_memory=self.low_memory).fit(embeddings) 1102 else: -> 1103 self.umap_model.fit(embeddings) 1104 umap_embeddings = self.umap_model.transform(embeddings) 1105 logger.info(“Reduced dimensionality with UMAP”)
C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\umap\umap_.py in fit(self, X, y) 2551 2552 if self.transform_mode == “embedding”: -> 2553 self.embedding_, aux_data = self._fit_embed_data( 2554 self._raw_data[index], n_epochs, init, random_state, # JH why raw data? 2555 )
C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\umap\umap_.py in fit_embed_data(self, X, n_epochs, init, random_state) 2578 replaced by subclasses. 2579 “”" -> 2580 return simplicial_set_embedding( 2581 X, 2582 self.graph,
C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\umap\umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose) 1052 elif isinstance(init, str) and init == “spectral”: 1053 # We add a little noise to avoid local minima for optimization to come -> 1054 initialisation = spectral_layout( 1055 data, 1056 graph,
C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\umap\spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds) 325 try: 326 if L.shape[0] < 2000000: –> 327 eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh( 328 L, 329 k,
C:\ProgramData\Anaconda3\envs\my_project\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode) 1596 1597 if issparse(A): -> 1598 raise TypeError(“Cannot use scipy.linalg.eigh for sparse A with " 1599 “k >= N. Use scipy.linalg.eigh(A.toarray()) or” 1600 " reduce k.”)
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Related StackOverflow Question
Since duplicating documents helped, then it might be that you have too few documents to work with in order for UMAP to create lower-dimensional representations. You could use the duplication trick to train your model and then divide all frequencies by 2 to get the correct values.
Hey Maarten, to clarify a misunderstanding: Duplicating the documents did actually help.
I’m using: