如何在 scikit-learn 中使用字符串内核? [英] How to use string kernels in scikit-learn?

查看:38
本文介绍了如何在 scikit-learn 中使用字符串内核?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试生成一个字符串内核,用于提供支持向量分类器.我用一个计算内核的函数试了一下,类似这样

def stringkernel(K, G):对于范围内(len(K)):对于范围内的 b(len(G)):R[a][b] = scipy.exp(editdistance(K[a], G[b]) ** 2)返回 R

当我将它作为参数传递给 SVC 时,我得到

 clf = svm.SVC(kernel = my_kernel)clf.fit(数据,目标)ValueError:无法将字符串转换为浮点数:摄影

其中我的数据是一个字符串列表,目标是这个字符串所属的通信类.我已经查看了有关此问题的 stackoverflow 中的一些问题,但我认为词袋表示不适合这种情况.

解决方案

这是 scikit-learn 中的一个限制,已被证明很难摆脱.您可以尝试此解决方法.仅用一个特征表示特征向量中的字符串,这实际上只是字符串表中的一个索引.

<预><代码>>>>数据 = ["foo", "bar", "baz"]>>>X = np.arange(len(data)).reshape(-1, 1)>>>X数组([[0],[1],[2]])

重新定义字符串核函数以处理此表示:

<预><代码>>>>def string_kernel(X, Y):... R = np.zeros((len(x), len(y)))... 对于 X 中的 x:...对于 Y 中的 y:... i = int(x[0])... j = int(y[0])... # 史上最简单的内核... R[i, j] = 数据[i][0] == 数据[j][0]...返回R...>>>clf = SVC(kernel=string_kernel)>>>clf.fit(X, ['no', 'yes', 'yes'])SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,kernel=, max_iter=-1,概率=假,随机状态=无,收缩=真,tol=0.001,详细=假)

这样做的缺点是要对新样本进行分类,您必须将它们添加到data,然后为它们构建新的伪特征向量.

<预><代码>>>>data.extend(["bla", "傻瓜"])>>>clf.predict([[3], [4]])数组(['是','否'],dtype='|S3')

(您可以通过对伪特征进行更多解释来解决这个问题,例如,查看 i >= len(X_train) 的不同表.但这仍然很麻烦.)

这是一个丑陋的 hack,但它有效(它对于聚类来说不那么丑陋,因为在 fit 之后数据集不会改变).我代表 scikit-learn 开发人员说,欢迎提供一个补丁来正确修复此问题.

I am trying to generate a string kernel that feeds a support vector classifier. I tried it with a function that calculates the kernel, something like that

def stringkernel(K, G):
    for a in range(len(K)):
        for b in range(len(G)):
            R[a][b] = scipy.exp(editdistance(K[a] , G[b]) ** 2)
    return R

And when I pass it to SVC as a parameter I get

 clf = svm.SVC(kernel = my_kernel)
 clf.fit(data, target)

 ValueError: could not convert string to float: photography

where my data is a list of strings and the target is the correspondent class this string belongs to. I have reviewed some questions in stackoverflow regarding this issue, but I think a Bag-of-words representations is not appropiate for this case.

解决方案

This is a limitation in scikit-learn that has proved hard to get rid of. You can try this workaround. Represent the strings in feature vectors with only one feature, which is really just an index into the table of strings.

>>> data = ["foo", "bar", "baz"]
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
       [1],
       [2]])

Redefine the string kernel function to work on this representation:

>>> def string_kernel(X, Y):
...     R = np.zeros((len(x), len(y)))
...     for x in X:
...         for y in Y:
...             i = int(x[0])
...             j = int(y[0])
...             # simplest kernel ever
...             R[i, j] = data[i][0] == data[j][0]
...     return R
... 
>>> clf = SVC(kernel=string_kernel)
>>> clf.fit(X, ['no', 'yes', 'yes'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel=<function string_kernel at 0x7f5988f0bde8>, max_iter=-1,
  probability=False, random_state=None, shrinking=True, tol=0.001,
  verbose=False)

The downside to this is that to classify new samples, you have to add them to data, then construct new pseudo-feature vectors for them.

>>> data.extend(["bla", "fool"])
>>> clf.predict([[3], [4]])
array(['yes', 'no'], 
      dtype='|S3')

(You can get around this by doing more interpretation of your pseudo-features, e.g., looking into a different table for i >= len(X_train). But it's still cumbersome.)

This is an ugly hack, but it works (it's slightly less ugly for clustering because there the dataset doesn't change after fit). Speaking on behalf of the scikit-learn developers, I say a patch to fix this properly is welcome.

这篇关于如何在 scikit-learn 中使用字符串内核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆