在Python Pandas中训练朴素贝叶斯的不同类型的功能 [英] Different types of features to train Naive Bayes in Python Pandas

查看:101
本文介绍了在Python Pandas中训练朴素贝叶斯的不同类型的功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用许多功能与朴素贝叶斯分类器一起训练,以对'A'或'non-A'进行分类.

I would like to use a number of features to train with Naive Bayes classifier to classify 'A' or 'non-A'.

我具有三个具有不同值类型的特征: 1)total_length-以正整数表示 2)元音比-十进制/分数 3)twoLetters_lastName-包含多个两个字母字符串的数组

I have three features of different value types: 1) total_length - in positive integer 2) vowel-ratio - in decimal/fraction 3) twoLetters_lastName - a array containing multiple two-letters strings

# coding=utf-8
from nltk.corpus import names
import nltk
import random
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from sklearn.naive_bayes import GaussianNB
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

# Import data into pandas
data = pd.read_csv('XYZ.csv', header=0, encoding='utf-8', 
    low_memory=False)
df = DataFrame(data)

# Randomize records
df = df.reindex(np.random.permutation(df.index))

# Assign column into label Y
df_Y = df[df.AScan.notnull()][['AScan']].values # Labels are 'A' or 'non-A'
#print df_Y

# Assign column vector into attribute X
df_X = df[df.AScan.notnull()][['total_length', 'vowel_ratio', 'twoLetters_lastName']].values
#print df_X[0:10]

# Incorporate X and Y into ML algorithms
clf = GaussianNB()
clf.fit(df_X, df_Y)

df_Y如下:

[[u'non-A']
 [u'A']
 [u'non-A']
 ..., 
 [u'A']
 [u'non-A']
 [u'non-A']]

df_X在下面:

[[9L 0.222222222 u"[u'ke', u'el', u'll', u'ly']"]
 [17L 0.41176470600000004
  u"[u'ma', u'ar', u'rg', u'ga', u'ar', u'ri', u'is']"]
 [11L 0.454545455 u"[u'du', u'ub', u'bu', u'uc']"]
 [11L 0.454545455 u"[u'ma', u'ah', u'he', u'er']"]
 [15L 0.333333333 u"[u'ma', u'ag', u'ge', u'ee']"]
 [13L 0.307692308 u"[u'jo', u'on', u'ne', u'es']"]
 [12L 0.41666666700000005
  u"[u'le', u'ef', u'f\\xe8', u'\\xe8v', u'vr', u're']"]
 [15L 0.26666666699999997 u"[u'ni', u'ib', u'bl', u'le', u'et', u'tt']"]
 [15L 0.333333333 u"[u'ki', u'in', u'ns', u'sa', u'al', u'll', u'la']"]
 [11L 0.363636364 u"[u'mc', u'cn', u'ne', u'ei', u'il']"]]

我收到此错误:

E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py:150: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Traceback (most recent call last):
  File "C:werwer\wer\wer.py", line 32, in <module>
    clf.fit(df_X, df_Y)
  File "E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py", line 163, in fit
    self.theta_[i, :] = np.mean(Xi, axis=0)
  File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2727, in mean
    out=out, keepdims=keepdims)
  File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\_methods.py", line 69, in _mean
    ret, rcount, out=ret, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'

我的理解是,我需要将特征转换为一个numpy数组作为特征向量,但是我不认为我是否正在准备此X向量,因为它包含非常不同的值类型.

My understanding is I need to convert the features into one numpy array as a feature vector, but I don't think if I am preparing this X vector right since it contains very different value types.

推荐答案

相关问题:使用Scikit学习在朴素贝叶斯分类器中混合分类数据和连续数据

好的,现在有几件事发生了.正如DalekSec指出的那样,最好的做法是在将所有功能输入到GaussianNB这样的模型中时,将所有功能保持为一种类型.追溯表明,在拟合模型时,它将尝试将字符串(大概是u"[u'ke', u'el', u'll', u'ly']"这样的unicode字符串之一)除以整数.因此,我们需要做的就是将训练数据转换为sklearn可以使用的形式.我们可以通过几种方式做到这一点,在此处.ogrisel雄辩地描述了其中的两种方式.

Okay so there are a few things going on. As DalekSec pointed out, it's best practice to keep all your features as one type as you input them into a model like GaussianNB. The traceback indicates that while fitting the model, it tries to divide a string (presumably one of your unicode strings like u"[u'ke', u'el', u'll', u'ly']") by an integer. So what we need to do is convert the training data into a form that sklearn can use. We can do this a few ways, two of which ogrisel eloquently describes in this answer here.

  1. 我们可以将所有连续变量转换为分类变量.在我们的例子中,这意味着转换total_lengthvowel-ratio(在某些情况下,您可以将其视为分类变量,但不要超越自己).例如,您基本上可以根据百分比将每个功能中看到的值归类为5个值之一:非常小",小的",中等",高",非常高".据我所知,在sk-learn中没有真正简单的方法,但是您自己做应该很简单.您唯一要更改的是,您将要使用MultinomialNB而不是GaussianNB,因为您将要处理用多项式分布而不是高斯分布更好地描述的特征.

  1. We can convert all the continuous variables to categorical variables. In our case, this means converting total_length (in some cases you could probably treat this as a categorical variable, but let's not get ahead of ourselves) and vowel-ratio. For instance, you can basically bin the values you see in each feature to one of 5 values based on percentile: 'very small', 'small', 'medium', 'high', 'very high'. There's no real easy way in sk-learn as far as I know, but it should be pretty straightforward to do it yourself. The only thing that you would want to change is that you would want to use MultinomialNB instead of GaussianNB because you'll be dealing with features that would be better described by multinomial distributions rather than gaussian ones.

我们可以将分类特征转换为数字特征,以供GaussianNB使用.我个人认为这是更直观的方法.基本上,在处理文本时,您需要弄清楚要从文本中获取什么信息并传递给分类器.在我看来,您想要提取不同的两个字母姓氏的发生率.

We can convert the categorical features to numeric ones for use with GaussianNB. Personally I find this to be the more intuitive approach. Basically, when dealing with text, you need to figure out what information you want to take from the text and pass to the classifier. It looks like to me that you want to extract the incidence of different two letter last names.

通常我会问你是否在数据集中拥有所有姓氏,但是由于每个姓氏只有两个字母,我们可以将所有可能的两个字母名(包括带有重音符号的unicode字符)存储为对性能的影响最小.这是sklearn的CountVectorizer可能有用的地方.假设您的数据中包含两个字母姓氏的所有可能组合,则可以直接使用它来将twoLetter_lastname列中的一行转换为N维向量,该向量记录该行中每个唯一姓氏的出现次数.然后,只需将此新矢量与您的其他两个功能组合到一个numpy数组中即可.

Normally I would ask you whether or not you have all the last names in your dataset, but since each one is only two letters each we can just store all the possible two letter names (including the unicode characters involving accent marks) with a minimal impact on performance. This is where something like sklearn's CountVectorizer might be useful. Assuming that you have every possible combination of two letter last names in your data, you can just directly use this to turn a row in your twoLetter_lastname column into a N-dimensional vector that records the number of occurrences of each unique last name in your row. Then just combine this new vector with your other two features into a numpy array.

如果没有两个字母(包括带重音符号)的所有可能组合,则应考虑生成该列表并将其作为CountVectorizer的词汇表"传递.这样一来,您的分类器便知道如何处理所有可能的姓氏. 如果您不处理所有情况,这还不是世界末日,但是在此方案中,任何新的看不见的两个字母对都将被忽略.

In the case you do not have every possible combination of two letters (including accented ones), you should consider generating that list and pass it in as the 'vocabulary' for the CountVectorizer. This is so that your classifier knows how to handle all possible last names. It's not the end of the world if you don't handle all cases, but any new unseen two letter pairs will be ignored in this scheme.

在使用这些工具之前,应确保将您的姓氏列作为列表而不是字符串进行传递,因为这可能导致意外的行为.

Before you use these tools, you should make sure that you pass your last name column in as a list, and not as a string, as this can result in unintended behavior.

您可以在此处了解更多有关常规sklearn预处理的信息,以及更多有关的信息sklearn 此处.我每天都会使用很多这样的工具,并推荐它们用于基本的文本提取任务.在线上也有大量的教程和演示.您可能还会寻找其他类型的表示方法,例如二进制和一键编码.解决此问题的方法有很多,这主要取决于您的特定问题/需求.

You can read more about general sklearn preprocessing here, and more about CountVectorizer and other text feature extraction tools provided by sklearn here. I use a lot of these tools daily, and recommend them for basic text extraction tasks. There are also plenty of tutorials and demos available online. You might also look for other types of methods of representation, like binarizing and one-hot encoding. There are many ways to solve this problem, it mostly depends on your specific problem/needs.


在将所有数据转换为一种或另一种形式后,您应该可以使用高斯或多项式NB分类器.至于关于一维矢量的错误,则打印了df_Y,看起来像


After you're able to turn all your data into one form or the other, you should be able to make use of either the Gaussian or Multinomial NB classifier. As for your error regarding the 1D vector, you printed df_Y and it looked like

[[u'non-A']
 [u'A']
 [u'non-A']
 ..., 
 [u'A']
 [u'non-A']
 [u'non-A']]

基本上,期望它在平面列表中,而不是列向量(一维列表的列表)中.只需使用numpy.reshape()或numpy.ravel()之类的命令对它进行相应的重塑(考虑到您只处理了一列,就可能出现上述错误,numpy.ravel()可能更合适).

Basically, it's expecting this to be in a flat list, rather than as a column vector (a list of one-dimensional lists). Just reshape it accordingly by making use of commands like numpy.reshape() or numpy.ravel() (numpy.ravel() would probably be more appropriate, considering that you're dealing with just one column, as the error mentioned).

这篇关于在Python Pandas中训练朴素贝叶斯的不同类型的功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆