DBScan Clustering of string data

See original GitHub issue

So, I want to use the DBScan clustering algorithm from scikit learn, on a dataset of string charcaters. So, I used metric=Levenshtein.distance and my data is given in a numpy array. But, why it wants to convert them to float?

I am using the following packages:

CPython 2.7.3
IPython 2.0.0

numpy 1.9.0
scipy 0.14.0
scikit-learn 0.15.2
matplotlib 1.4.0
pyprind 2.5.0

and my data shape and dbscan object is defined as below:

d_arr.shape
(30079, 1)

dbs = sklearn.cluster.DBSCAN(eps=4, min_samples=2, metric=Levenshtein.distance)
dbs.fit(d_arr)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-a7573a2d5bba> in <module>()
----> 1 dbs.fit(d_arr)

/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in fit(self, X)
    246             Overwrite keywords from __init__.
    247         """
--> 248         clust = dbscan(X, **self.get_params())
    249         self.core_sample_indices_, self.labels_ = clust
    250         self.components_ = X[self.core_sample_indices_].copy()

/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, random_state)
     99                                            leaf_size=leaf_size,
    100                                            metric=metric, p=p)
--> 101         neighbors_model.fit(X)
    102 
    103     # Calculate neighborhood for all samples. This leaves the original point

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
    659             Training data. If array or matrix, shape = [n_samples, n_features]
    660         """
--> 661         return self._fit(X)

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
    232             self._tree = BallTree(X, self.leaf_size,
    233                                   metric=self.effective_metric_,
--> 234                                   **self.effective_metric_params_)
    235         elif self._fit_method == 'kd_tree':
    236             self._tree = KDTree(X, self.leaf_size,

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()

/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    460 
    461     """
--> 462     return array(a, dtype, copy=False, order=order)
    463 
    464 def asanyarray(a, dtype=None, order=None):

ValueError: could not convert string to float: TACGTAGGGGGCAA

Issue Analytics

  • State:open
  • Created 9 years ago
  • Comments:17 (10 by maintainers)

github_iconTop GitHub Comments

4reactions
argriffingcommented, Oct 13, 2014

That is a much more general workaround, putting the lookup information into the metric instead of encoding it into the points! If you want to do this more literally you can write something like

from functools import partial
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from pylev import levenshtein

strings = ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']

def mylev(s, u, v):
    return levenshtein(s[int(u)], s[int(v)])

print pairwise_distances(
        np.arange(len(strings)).reshape(-1, 1),
        metric=partial(mylev, strings))
[[ 0.  8.  9.  8.  8.  9.]
 [ 8.  0.  7.  6.  6.  9.]
 [ 9.  7.  0.  8.  8.  7.]
 [ 8.  6.  8.  0.  1.  9.]
 [ 8.  6.  8.  1.  0.  9.]
 [ 9.  9.  7.  9.  9.  0.]]

Another thing you can do with a metric instead of a precomputed distance matrix is to save memory. For example you might want a function to run with time O(N*N) and with memory O(N).

3reactions
larsmanscommented, Oct 13, 2014

To exploit triangle inequality, you need to use d(A, B) + d(B, C) as a bound on d(A, C) in a bounded version of the edit distance DP algorithm. It needs to remember these distances, thus building the distance matrix lazily. Is that what you mean?

Interestingly, when thinking about how to implement that, I figured out a different way to solve the original problem. Just let the samples be vectors containing an array into a table, and have the metric do the lookup.

>>> from leven import levenshtein
>>> from sklearn.cluster import dbscan
>>> strings = ["foo", "bar", "foobar"]
>>> def lev_metric(x, y):
...     i, j = int(x[0]), int(y[0])  # extract indices
...     return levenshtein(strings[i], strings[j])
... 
>>> dbscan(arange(len(strings)).reshape(-1, 1), metric=lev_metric)
([], array([-1, -1, -1]))
Read more comments on GitHub >

github_iconTop Results From Across the Web

Can DBSCAN be applied on clustering Strings, If so how it ...
I have been told to do a project in my final sem, as the project involves in clustering similar strings using DBSCAN. I...
Read more >
DBSCAN Clustering in ML | Density based clustering
DBSCAN algorithm identifies the dense region by grouping together data points that are closed to each other based on distance measurement.
Read more >
DBSCAN Clustering — Explained
DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with ...
Read more >
sklearn.cluster.DBSCAN — scikit-learn 1.2.0 documentation
DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data...
Read more >
DBSCAN Clustering Algorithm in Machine Learning
Clustering analysis is an unsupervised learning method that separates the data points into several specific bunches or groups, such that the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found