python numpy成对编辑距离 [英] python numpy pairwise edit-distance
问题描述
因此,我有一个字符串数组,我想使用以下函数计算每对元素之间的成对编辑距离:scipy.spatial.distance.pdist from
So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html
我的数组的示例如下:
>>> d[0:10]
array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
'GATTT', 'TCTTT', 'ACTTT'],
dtype='|S5')
但是,由于它没有'editdistance'选项,因此,我想提供一个自定义的距离函数.我尝试了此操作,但遇到了以下错误:
However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:
>>> import editdist
>>> import scipy
>>> import scipy.spatial
>>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
X = np.double(X)
ValueError: could not convert string to float: TTTTT
推荐答案
如果确实必须使用pdist
,则首先需要将字符串转换为数字格式.如果您知道所有字符串的长度都相同,则可以很容易地做到这一点:
If you really must use pdist
, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:
numeric_d = d.view(np.uint8).reshape((len(d),-1))
这只是将您的字符串数组视为一个uint8
个字节的长数组,然后对其进行重塑,以使每个原始字符串单独位于一行上.在您的示例中,这看起来像:
This simply views your array of strings as a long array of uint8
bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:
In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
[65, 84, 84, 84, 84],
[67, 84, 84, 84, 84],
[71, 84, 84, 84, 84],
[84, 65, 84, 84, 84],
[65, 65, 84, 84, 84],
[67, 65, 84, 84, 84],
[71, 65, 84, 84, 84],
[84, 67, 84, 84, 84],
[65, 67, 84, 84, 84]], dtype=uint8)
然后,您可以像平常一样使用pdist
.只要确保您的editdist
函数期望的是整数数组,而不是字符串.您可以通过调用.tostring()
:
Then, you can use pdist
as you normally would. Just make sure that your editdist
function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling .tostring()
:
def editdist(x, y):
s1 = x.tostring()
s2 = y.tostring()
... rest of function as before ...
这篇关于python numpy成对编辑距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!