python numpy成对编辑距离 [英] python numpy pairwise edit-distance

查看:168
本文介绍了python numpy成对编辑距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我有一个字符串数组,我想使用以下函数计算每对元素之间的成对编辑距离:scipy.spatial.distance.pdist from

So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html

我的数组的示例如下:

 >>> d[0:10]
 array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
   'GATTT', 'TCTTT', 'ACTTT'], 
  dtype='|S5')

但是,由于它没有'editdistance'选项,因此,我想提供一个自定义的距离函数.我尝试了此操作,但遇到了以下错误:

However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:

 >>> import editdist
 >>> import scipy
 >>> import scipy.spatial
 >>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
    X = np.double(X)
ValueError: could not convert string to float: TTTTT

推荐答案

如果确实必须使用pdist,则首先需要将字符串转换为数字格式.如果您知道所有字符串的长度都相同,则可以很容易地做到这一点:

If you really must use pdist, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:

numeric_d = d.view(np.uint8).reshape((len(d),-1))

这只是将您的字符串数组视为一个uint8个字节的长数组,然后对其进行重塑,以使每个原始字符串单独位于一行上.在您的示例中,这看起来像:

This simply views your array of strings as a long array of uint8 bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:

In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
       [65, 84, 84, 84, 84],
       [67, 84, 84, 84, 84],
       [71, 84, 84, 84, 84],
       [84, 65, 84, 84, 84],
       [65, 65, 84, 84, 84],
       [67, 65, 84, 84, 84],
       [71, 65, 84, 84, 84],
       [84, 67, 84, 84, 84],
       [65, 67, 84, 84, 84]], dtype=uint8)

然后,您可以像平常一样使用pdist.只要确保您的editdist函数期望的是整数数组,而不是字符串.您可以通过调用.tostring():

Then, you can use pdist as you normally would. Just make sure that your editdist function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling .tostring():

def editdist(x, y):
  s1 = x.tostring()
  s2 = y.tostring()
  ... rest of function as before ...

这篇关于python numpy成对编辑距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆