有些人怀疑python中libsvm / scikit-learn库的某些功能 [英] Some doubts modelling some features for the libsvm/scikit-learn library in python

查看:162
本文介绍了有些人怀疑python中libsvm / scikit-learn库的某些功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经刮了很多这样的ebay标题:

  Apple iPhone 5白色16GB双核

我已经以这种方式手动标记了所有这些

  BMCS NA 

其中B =品牌(Apple)M =型号(iPhone 5)C =颜色(白色)S =尺寸(大小)NA =未分配(双核)



现在我需要使用SVM分类器python中的libsvm库可以了解ebay标题中出现的序列模式。



我需要提取该属性的新值(品牌,型号,颜色,大小)将问题作为分类考虑。以这种方式我可以预测新的模型。



我想表示这些功能,将它们用作libsvm库的输入。我在python工作:D。



  1. 当前单词的标识


我认为我可以这样解释

  0 - >品牌
1 - >型号
2 - >颜色
3 - >大小
4 - > NA

如果我知道这个词是品牌,我将把该变量设置为1(true)。在培训测试中可以做到这一点(因为我已经标记了所有的单词),但是如何才能做到这一点呢?我不知道一个单词的类别(这就是为什么我正在学习它:D)。



  1. 当前单词的N-gram子字符串特征(N = 4,5,6)


没有想法,这是什么意思?



  1. 身份在当前单词之前的2个单词。


如何为此功能建模?



考虑到我为第一个功能创建的图例,我有5 ^(5)组合)

  00 10 20 30 40 
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44

如何将其转换为libsvm(或scikit-learn)可以理解的格式?

  4。成员资格的4个词典属性

再次如何做?有4个字典(对于颜色,大小,型号和品牌),我必须创建一个bool变量,当且仅当我在4个词典之一中匹配当前单词时,我将设置为true。



  1. 品牌字典的独家会员资格


我认为像在4.功能中一样,我必须使用一个bool变量。你同意吗?



如果这个问题缺少一些信息,请在这个地址上阅读我以前的问题:支持向量机在Python中使用libsvm的功能示例



最后的疑问:如果我有一个像iPhone 5的多重令牌值...我必须标记iPhone像一个品牌,5也喜欢一个品牌,或者更好地标记{iPhone 5}所有品牌?



在测试数据集中,iPhone和5将是2个分离单词...所以最好做什么?

解决方案

解决方案在上一个问题中向您提出的原因不尽如人意(我假设) - 该功能对于这个问题是不好的。



如果我理解正确,你想要的是:



给出句子 -


苹果iPhone 5白色16GB双核



你得到 -


BMCS NA


你在这里描述的问题等同于自然语言处理中的部分话题标签(POS)



考虑以下英文句子:


我们看到黄狗


POS的任务是为每个字。在这种情况下:


我们(PRP)看到(VBD)(DT)黄色(JJ)狗(NN)

不要花时间理解这里的英文标签,因为我在这里只给出你的问题和POS是否相同。 / p>

在我解释如何使用SVM解决之前,我想让您了解其他方法:考虑句子 Apple iPhone 5 White 16GB Dual- Core 作为测试数据。当您标记单词 iPhone 时,您设置为单词 Apple 的标签必须作为输入标记。但是,在您将单词标记后,您将不会更改它。因此,正在做顺序标记的模型通常会获得更好的结果。最简单的例子是隐马尔可夫模型(HMM)。 这里是POS中HMM的简介。



现在我们将这个问题建模成分类问题。让我们定义一个窗口 -

 `W-2,W-1,W0,W1,W2` 

在这里,我们有一个大小为2的窗口。当分类单词 W0 ,我们将需要窗口中所有单词的功能(并置)。请注意,对于我们将使用的句子的第一个单词:

 `START-2,START-1,W0,W1 ,W2` 

为了模拟这是第一个单词的事实。对于我们拥有的第二个词:

 `START-1,W-1,W0,W1,W2` 

同样的句子末尾的单词。标签 START-2 START-1 STOP1 STOP2 必须添加到模型2中。



现在,让我们描述用于标记W0的功能是什么:特征(W-2),特征(W0),特征(W1),特征(W-2),特征(W-1) (W2)

令牌的特征应该是单词本身,标签(给予上一个字)。我们将使用二进制功能。



示例 - 如何构建功能表示:



< h3> 步骤1 - 构建每个令牌的单词表示

让窗口大小为1.当对令牌进行分类时,我们使用 W1,W0,W1 。假设你建立了一个字典,并且在语料库中给出了一个数字:

  n ['Apple'] = 0 
n ['iPhone 5'] = 1
n ['White'] = 2
n ['16GB'] = 3
n ['Dual-Core'] = 4
n [ 'START-1'] = 5
n ['STOP1'] = 6



步骤2 - 每个标签的功能标记



我们为以下标签创建功能:

  n ['B'] = 7 
n ['M'] = 8
n ['C'] = 9
n ['S'] = 10
n ['NA'] = 11
n ['START-1'] = 12
n ['STOP1'] = 13
pre>

让我们为 START-1,Apple,iPhone 5 构建一个特征向量:第一个标记是一个字使用已知标签( START-1 将始终包含标签 START-1 )。所以这个标记的功能是:

 (0,0,0,0,0,0,1,0,0 ,0,0,0,1,0)

(功能是1:有单词 START-1 ,标签 START-1



对于令牌 Apple

 (1,0,0, 0,0,0,0)

请注意,我们对每个单词都使用已计算的标签功能之前W0(因为我们已经计算了)。类似地,令牌的功能 iPhone 5

 (0, 1,0,0,0,0,0)



步骤3连接所有功能



一般来说,1窗口的功能将是:

 code> word(W-1),tag(W-1),word(W0),word(W1)

关于您的问题 - 我会再使用一个标签 - 号码 - 以便当您标记单词 5 (由于您按空格分割标题),功能 W0 将在一些数字功能上有1,而在 W- 1 模型标记 - 以防止以前的标记被正确标记为模型。



总结一下,你应该怎么做:




  1. 数据中的每个单词

  2. 为列车数据构建功能表示(使用您已经手动计算的标签)

  3. 训练模型

  4. 标记测试数据



最终注释 - 现有代码的温馨提示:



您可以在python here 。它包括对问题和代码的解释,并且它也提供了我刚刚描述的功能。另外,他们使用设置来表示每个单词的功能,所以代码要简单得多。



此标签接收的数据应如下所示:

  Apple_B iPhone_M 5_NUMBER White_C 16GB_S Dual-Core_NA 

功能提取以这种方式进行(请参阅上面的链接):



将$ {$ $ $ $ $ $'$'$'$'$'$'$'$'$'$' 。
def add(name,* args):
features.add('+'。join((name,)+ tuple(args ))

features = set()
add('bias')#这样做就像一个先前的
add('i suffix',word [-3: ]
add('i-1'',上一个)
add('i word',context [i])
add('i-1' 1])$ ​​b $ b add('i + 1'',context [i + 1])$ ​​b $ b返回特征

对于上述示例:

  context = [Apple,iPhone,5 ,白,16GB,双核] 
prev =B
i = 1
word =iPhone

通常,是当前单词的str,上下文是标题拆分为列表, prev 是您为上一个单词收到的标签。



我过去使用这段代码,效果很好,效果很好。
希望清楚,有趣的标签!


I have scraped a lot of ebay titles like this one:

Apple iPhone 5 White 16GB Dual-Core

and I have manually tagged all of them in this way

B M C S NA

where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core)

Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles.

I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models.

I want to represent these features to use them as input to the libsvm library. I work in python :D.

  1. Identity of the current word

I think that I can interpret it in this way

0 --> Brand
1 --> Model
2 --> Color
3 --> Size 
4 --> NA

If I know that the word is a Brand I will set that variable to 1 (true). It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D).

  1. N-gram substring features of current word (N=4,5,6)

No Idea, what does it means?

  1. Identity of 2 words before the current word.

How can I model this feature?

Considering the legend that I create for the 1st feature I have 5^(5) combination)

00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44

How can I convert it to a format that the libsvm (or scikit-learn) can understand?

4. Membership to the 4 dictionaries of attributes

Again how can I do it? Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries.

  1. Exclusive membership to dictionary of brand names

I think that like in the 4. feature I must use a bool variable. Do you agree?

If this question lacks some info please read my previous question at this address: Support vector machine in Python using libsvm example of features

Last doubt: If I have a multi token value like iPhone 5... I must tag iPhone like a brand and 5 also like a brand or is better to tag {iPhone 5} all as a brand??

In the test dataset iPhone and 5 will be 2 separates word... so what is better to do?

解决方案

The reason that the solution proposed to you in the previous question had Insufficient results (I assume) - is that the feature were poor for this problem.

If I understand correctly, What you want is the following:

given the sentence -

Apple iPhone 5 White 16GB Dual-Core

You to get-

B M C S NA

The problem you are describing here is equivalent to part of speech tagging (POS) in Natural Language Processing.

Consider the following sentence in English:

We saw the yellow dog

The task of POS is giving the appropriate tag for each word. In this case:

We(PRP) saw(VBD) the(DT) yellow(JJ) dog(NN)

Don't invest time on understanding the tags in English here, since I give it here only to show you that your problem and POS are equal.

Before I explain how to solve it using SVM, I want to make you aware of other approaches: consider the sentence Apple iPhone 5 White 16GB Dual-Core as test data. The tag you set to the word Apple must be given as input to the tagger when you are tagging the word iPhone. However, After you tag the word a word, you will not change it. Hence, models that are doing sequance tagging usually recievces better results. The easiest example is Hidden Markov Models (HMM). Here is a short intro to HMM in POS.

Now we model this problem as classification problem. Lets define what is a window -

`W-2,W-1,W0,W1,W2`

Here, we have a window of size 2. When classifying the word W0, we will need the features of all the words in the window (concatenated). Please note that for the first word of the sentence we will use:

`START-2,START-1,W0,W1,W2`

In order to model the fact that this is the first word. for the second word we have:

`START-1,W-1,W0,W1,W2`

And similarly for the words at the end of the sentence. The tags START-2,START-1,STOP1,STOP2 must be added to the model two.

Now, Lets describe what are the features used for tagging W0:

Features(W-2),Features(W-1),Features(W0),Features(W1),Features(W2)

The features of a token should be the word itself, and the tag (given to the previous word). We shall use binary features.

Example - how to build the feature representation:

Step 1 - building the word representation for each token:

Lets take a window size of 1. When classifying a token, we use W-1,W0,W1. Say you build a dictionary, and gave every word in the corpus a number:

n['Apple'] = 0
n['iPhone 5'] = 1
n['White'] = 2
n['16GB'] = 3
n['Dual-Core'] = 4
n['START-1'] = 5
n['STOP1'] = 6

Step 2 - feature token for each tag:

we create features for the following tags:

n['B'] = 7 
n['M'] = 8
n['C'] = 9 
n['S'] = 10 
n['NA'] = 11
n['START-1'] = 12
n['STOP1'] = 13

Lets build a feature vector for START-1,Apple,iPhone 5: the first token is a word with known tag (START-1 will always have the tag START-1). So the features for this token are:

(0,0,0,0,0,0,1,0,0,0,0,0,1,0)

(The features that are 1: having the word START-1, and tag START-1)

For the token Apple:

(1,0,0,0,0,0,0)

Note that we use already-calculated-tags feature for every word before W0 (since we have already calculated it) . Similarly, the features of the token iPhone 5:

(0,1,0,0,0,0,0)

Step 3 concatenate all the features:

Generally, the features for 1-window will be:

word(W-1),tag(W-1),word(W0),word(W1)

Regarding your question - I would use one more tag - number - so that when you tag the word 5 (since you split the title by space), the feature W0 will have a 1 on some number feature, and 1 in W-1's model tag - in case the previous token was tagged correctly as model.

To sum up, what you should do:

  1. give a number to each word in the data
  2. build feature representation for the train data (using the tags you already calculated manually)
  3. train a model
  4. label the test data

Final Note - a Warm Tip For Existing Code:

You can find POS tagger implemented in python here. It includes explanation of the problem and code, and it also does this feature extraction I just described for you. Additionally, they used set for representing the feature of each word, so the code is much simpler to read.

The data this tagger receives should look like this:

Apple_B iPhone_M 5_NUMBER White_C 16GB_S Dual-Core_NA

The feature extraction is doing in this manner (see more at the link above):

def get_features(i, word, context, prev):
    '''Map tokens-in-contexts into a feature representation, implemented as a
    set. If the features change, a new model must be trained.'''
    def add(name, *args):
        features.add('+'.join((name,) + tuple(args)))

    features = set()
    add('bias') # This acts sort of like a prior
    add('i suffix', word[-3:])
    add('i-1 tag', prev)
    add('i word', context[i])
    add('i-1 word', context[i-1])
    add('i+1 word', context[i+1])
    return features

For the example above:

context = ["Apple","iPhone","5","White","16GB","Dual-Core"]
prev = "B"
i = 1
word = "iPhone"

Generally, word is the str of the current word, context is a the title split into list, and prev is the tag you received for the previous word.

I use this code in the past, it works fast with great results. Hope its clear, Have fun tagging!

这篇关于有些人怀疑python中libsvm / scikit-learn库的某些功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆