DBScan Clustering of string data

See original GitHub issue

So, I want to use the DBScan clustering algorithm from scikit learn, on a dataset of string charcaters. So, I used metric=Levenshtein.distance and my data is given in a numpy array. But, why it wants to convert them to float?

I am using the following packages:

CPython 2.7.3
IPython 2.0.0

numpy 1.9.0
scipy 0.14.0
scikit-learn 0.15.2
matplotlib 1.4.0
pyprind 2.5.0

and my data shape and dbscan object is defined as below:

d_arr.shape
(30079, 1)

dbs = sklearn.cluster.DBSCAN(eps=4, min_samples=2, metric=Levenshtein.distance)

dbs.fit(d_arr)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-a7573a2d5bba> in <module>()
----> 1 dbs.fit(d_arr)

/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in fit(self, X)
    246             Overwrite keywords from __init__.
    247         """
--> 248         clust = dbscan(X, **self.get_params())
    249         self.core_sample_indices_, self.labels_ = clust
    250         self.components_ = X[self.core_sample_indices_].copy()

/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.pyc in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, random_state)
     99                                            leaf_size=leaf_size,
    100                                            metric=metric, p=p)
--> 101         neighbors_model.fit(X)
    102 
    103     # Calculate neighborhood for all samples. This leaves the original point

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
    659             Training data. If array or matrix, shape = [n_samples, n_features]
    660         """
--> 661         return self._fit(X)

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.pyc in _fit(self, X)
    232             self._tree = BallTree(X, self.leaf_size,
    233                                   metric=self.effective_metric_,
--> 234                                   **self.effective_metric_params_)
    235         elif self._fit_method == 'kd_tree':
    236             self._tree = KDTree(X, self.leaf_size,

/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:7983)()

/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    460 
    461     """
--> 462     return array(a, dtype, copy=False, order=order)
    463 
    464 def asanyarray(a, dtype=None, order=None):

ValueError: could not convert string to float: TACGTAGGGGGCAA

Issue Analytics

State:
Created 9 years ago
Comments:17 (10 by maintainers)

Top GitHub Comments

4reactions

argriffingcommented, Oct 13, 2014

That is a much more general workaround, putting the lookup information into the metric instead of encoding it into the points! If you want to do this more literally you can write something like

from functools import partial
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from pylev import levenshtein

strings = ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']

def mylev(s, u, v):
    return levenshtein(s[int(u)], s[int(v)])

print pairwise_distances(
        np.arange(len(strings)).reshape(-1, 1),
        metric=partial(mylev, strings))

[[ 0.  8.  9.  8.  8.  9.]
 [ 8.  0.  7.  6.  6.  9.]
 [ 9.  7.  0.  8.  8.  7.]
 [ 8.  6.  8.  0.  1.  9.]
 [ 8.  6.  8.  1.  0.  9.]
 [ 9.  9.  7.  9.  9.  0.]]

Another thing you can do with a metric instead of a precomputed distance matrix is to save memory. For example you might want a function to run with time O(N*N) and with memory O(N).

3reactions

larsmanscommented, Oct 13, 2014

To exploit triangle inequality, you need to use d(A, B) + d(B, C) as a bound on d(A, C) in a bounded version of the edit distance DP algorithm. It needs to remember these distances, thus building the distance matrix lazily. Is that what you mean?

Interestingly, when thinking about how to implement that, I figured out a different way to solve the original problem. Just let the samples be vectors containing an array into a table, and have the metric do the lookup.

>>> from leven import levenshtein
>>> from sklearn.cluster import dbscan
>>> strings = ["foo", "bar", "foobar"]
>>> def lev_metric(x, y):
...     i, j = int(x[0]), int(y[0])  # extract indices
...     return levenshtein(strings[i], strings[j])
... 
>>> dbscan(arange(len(strings)).reshape(-1, 1), metric=lev_metric)
([], array([-1, -1, -1]))

Top Results From Across the Web

Can DBSCAN be applied on clustering Strings, If so how it ...

I have been told to do a project in my final sem, as the project involves in clustering similar strings using DBSCAN. I...

DBSCAN Clustering in ML | Density based clustering

DBSCAN algorithm identifies the dense region by grouping together data points that are closed to each other based on distance measurement.

DBSCAN Clustering — Explained

DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with ...

sklearn.cluster.DBSCAN — scikit-learn 1.2.0 documentation

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data...

DBSCAN Clustering Algorithm in Machine Learning

Clustering analysis is an unsupervised learning method that separates the data points into several specific bunches or groups, such that the ...