使用 sklearn 和 pandas 在一个模型中组合词袋和其他特征 [英] Combining bag of words and other features in one model using sklearn and pandas

查看:64
本文介绍了使用 sklearn 和 pandas 在一个模型中组合词袋和其他特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据帖子的文字和其他功能(一天中的时间,帖子的长度等)对帖子获得的分数进行建模.

I am trying to model the score that a post receives, based on both the text of the post, and other features (time of day, length of post, etc.)

我想知道如何将这些不同类型的功能最佳地组合到一个模型中.现在,我有类似以下内容的内容(从此处

I am wondering how to best combine these different types of features into one model. Right now, I have something like the following (stolen from here and here).

import pandas as pd
...

def features(p):
    terms = vectorizer(p[0])
    d = {'feature_1': p[1], 'feature_2': p[2]}
    for t in terms:
        d[t] = d.get(t, 0) + 1
    return d

posts = pd.read_csv('path/to/csv')

# Create vectorizer for function to use
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2)).build_tokenizer()
y = posts["score"].values.astype(np.float32) 
vect = DictVectorizer()

# This is the part I want to fix
temp = zip(list(posts.message), list(posts.feature_1), list(posts.feature_2))
tokenized = map(lambda x: features(x), temp)
X = vect.fit_transform(tokenized)

从 Pandas 数据框中提取我想要的所有特征,只是将它们全部压缩在一起似乎很愚蠢.有没有更好的方法来执行此步骤?

It seems very silly to extract all of the features I want out of the pandas dataframe, just to zip them all back together. Is there a better way of doing this step?

CSV 如下所示:

ID,message,feature_1,feature_2
1,'This is the text',4,7
2,'This is more text',3,2
...

推荐答案

您可以使用地图和lambda进行所有操作:

You could do everything with your map and lambda:

tokenized=map(lambda msg, ft1, ft2: features([msg,ft1,ft2]), posts.message,posts.feature_1, posts.feature_2)

这样可以省去您的临时温度步骤,并且可以遍历3列.

This saves doing your interim temp step and iterates through the 3 columns.

另一种解决方案是将消息转换为其CountVectorizer稀疏矩阵,并将此矩阵与posts数据帧中的特征值结合在一起(此步骤无需构造dict,并生成类似于DictVectorizer的稀疏矩阵):

Another solution would be convert the messages into their CountVectorizer sparse matrix and join this matrix with the feature values from the posts dataframe (this skips having to construct a dict and produces a sparse matrix similar to what you would get with DictVectorizer):

import scipy as sp
posts = pd.read_csv('post.csv')

# Create vectorizer for function to use
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
y = posts["score"].values.astype(np.float32) 

X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),posts[['feature_1','feature_2']].values),format='csr')
X_columns=vectorizer.get_feature_names()+posts[['feature_1','feature_2']].columns.tolist()


posts
Out[38]: 
   ID              message  feature_1  feature_2  score
0   1   'This is the text'          4          7     10
1   2  'This is more text'          3          2      9
2   3   'More random text'          3          2      9

X_columns
Out[39]: 
[u'is',
 u'is more',
 u'is the',
 u'more',
 u'more random',
 u'more text',
 u'random',
 u'random text',
 u'text',
 u'the',
 u'the text',
 u'this',
 u'this is',
 'feature_1',
 'feature_2']

X.toarray()
Out[40]: 
array([[1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 4, 7],
       [1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 3, 2],
       [0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 3, 2]])

另外sklearn-pandas具有DataFrameMapper,它也可以满足您的需求:

Additionally sklearn-pandas has DataFrameMapper which does what you're looking for too:

from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
    (['feature_1', 'feature_2'], None),
    ('message',CountVectorizer(binary=True, ngram_range=(1, 2)))
])
X=mapper.fit_transform(posts)

X
Out[71]: 
array([[4, 7, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
       [3, 2, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1],
       [3, 2, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0]])

注意:使用最后一种方法时 X 不是稀疏的.

Note:X is not sparse when using this last method.

X_columns=mapper.features[0][0]+mapper.features[1][1].get_feature_names()

X_columns
Out[76]: 
['feature_1',
 'feature_2',
 u'is',
 u'is more',
 u'is the',
 u'more',
 u'more random',
 u'more text',
 u'random',
 u'random text',
 u'text',
 u'the',
 u'the text',
 u'this',
 u'this is']

这篇关于使用 sklearn 和 pandas 在一个模型中组合词袋和其他特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆