DBScan Clustering of string data
See original GitHub issueSo, I want to use the DBScan clustering algorithm from scikit learn, on a dataset of string charcaters. So, I used metric=Levenshtein.distance and my data is given in a numpy array. But, why it wants to convert them to float?
I am using the following packages:
CPython 2.7.3
IPython 2.0.0
numpy 1.9.0
scipy 0.14.0
scikit-learn 0.15.2
matplotlib 1.4.0
pyprind 2.5.0
and my data shape and dbscan object is defined as below:
d_arr.shape
(30079, 1)
dbs = sklearn.cluster.DBSCAN(eps=4, min_samples=2, metric=Levenshtein.distance)
dbs.fit(d_arr)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-a7573a2d5bba> in <module>()
----> 1 dbs.fit(d_arr)
/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in fit(self, X)
246 Overwrite keywords from __init__.
247 """
--> 248 clust = dbscan(X, **self.get_params())
249 self.core_sample_indices_, self.labels_ = clust
250 self.components_ = X[self.core_sample_indices_].copy()
/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, random_state)
99 leaf_size=leaf_size,
100 metric=metric, p=p)
--> 101 neighbors_model.fit(X)
102
103 # Calculate neighborhood for all samples. This leaves the original point
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
659 Training data. If array or matrix, shape = [n_samples, n_features]
660 """
--> 661 return self._fit(X)
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
232 self._tree = BallTree(X, self.leaf_size,
233 metric=self.effective_metric_,
--> 234 **self.effective_metric_params_)
235 elif self._fit_method == 'kd_tree':
236 self._tree = KDTree(X, self.leaf_size,
/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()
/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: could not convert string to float: TACGTAGGGGGCAA
Issue Analytics
- State:
- Created 9 years ago
- Comments:17 (10 by maintainers)
Top Results From Across the Web
Can DBSCAN be applied on clustering Strings, If so how it ...
I have been told to do a project in my final sem, as the project involves in clustering similar strings using DBSCAN. I...
Read more >DBSCAN Clustering in ML | Density based clustering
DBSCAN algorithm identifies the dense region by grouping together data points that are closed to each other based on distance measurement.
Read more >DBSCAN Clustering — Explained
DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with ...
Read more >sklearn.cluster.DBSCAN — scikit-learn 1.2.0 documentation
DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data...
Read more >DBSCAN Clustering Algorithm in Machine Learning
Clustering analysis is an unsupervised learning method that separates the data points into several specific bunches or groups, such that the ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
That is a much more general workaround, putting the lookup information into the metric instead of encoding it into the points! If you want to do this more literally you can write something like
Another thing you can do with a metric instead of a precomputed distance matrix is to save memory. For example you might want a function to run with time O(N*N) and with memory O(N).
To exploit triangle inequality, you need to use d(A, B) + d(B, C) as a bound on d(A, C) in a bounded version of the edit distance DP algorithm. It needs to remember these distances, thus building the distance matrix lazily. Is that what you mean?
Interestingly, when thinking about how to implement that, I figured out a different way to solve the original problem. Just let the samples be vectors containing an array into a table, and have the metric do the lookup.