在scikit-learn中使用Featureunion将两个 pandas 列合并为tfidf [英] use Featureunion in scikit-learn to combine two pandas columns for tfidf

查看:72
本文介绍了在scikit-learn中使用Featureunion将两个 pandas 列合并为tfidf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用作为模型对于垃圾邮件分类,我想添加主题的附加功能以及正文.

While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body.

我在熊猫数据框中拥有所有功能.例如,主题为df ['Subject'],正文为df ['body_text'],垃圾邮件/火腿标签为df ['ham/spam']

I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the spam/ham label is df['ham/spam']

我收到以下错误: TypeError:"FeatureUnion"对象不可迭代

I receive the following error: TypeError: 'FeatureUnion' object is not iterable

在通过管道函数运行它们时,如何同时使用df ['Subject']和df ['body_text']作为功能?

How can I use both df['Subject'] and df['body_text'] as features all while running them through the pipeline function?

from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))

pipeline = Pipeline([
('count_vectorizer',  CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer',  TfidfTransformer()),
('classifier',  MultinomialNB())])

pipeline.fit(combined_2, df['ham/spam'])

k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
    train_text = combined_2.iloc[train_indices]
    train_y = df.iloc[test_indices]['ham/spam'].values

    test_text = combined_2.iloc[test_indices]
    test_y = df.iloc[test_indices]['ham/spam'].values

    pipeline.fit(train_text, train_y)
    predictions = pipeline.predict(test_text)
    prediction_prob = pipeline.predict_proba(test_text)

    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label='spam')
    scores.append(score)

推荐答案

FeatureUnion并非该方式使用.相反,它需要两个特征提取器/矢量化器,并将它们应用于输入.它不会以构造方式获取数据.

FeatureUnion was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.

CountVectorizer需要一个字符串序列.提供它的最简单方法是将字符串连接在一起.这样会将两列中的两个文本都传递到相同的CountVectorizer.

CountVectorizer is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer.

combined_2 = df['Subject'] + ' '  + df['body_text']

另一种方法是在每列上分别运行CountVectorizer和可选的TfidfTransformer,然后堆叠结果.

An alternative method would be to run CountVectorizer and optionally TfidfTransformer individually on each column, and then stack the results.

import scipy.sparse as sp

subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])

body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['Subject'])

combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')

第三个选择是实现自己的转换器,该转换器将提取数据框列.

A third option is to implement your own transformer that would extract a dataframe column.

class DataFrameColumnExtracter(TransformerMixin):

    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]

在这种情况下,您可以在两条包含每个自定义转换器的管道上使用FeatureUnion,然后在CountVectorizer上使用.

In that case you could use FeatureUnion on two pipelines, each containing your custom transformer, then CountVectorizer.

subj_pipe = make_pipeline(
       DataFrameColumnExtracter('Subject'), 
       CountVectorizer()
)

body_pipe = make_pipeline(
       DataFrameColumnExtracter('body_text'), 
       CountVectorizer()
)

feature_union = make_union(subj_pipe, body_pipe)

管道的此功能结合将获取数据帧,并且每个管道将处理其列.它将从给定的两列中生成术语计数矩阵的串联.

This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.

 sparse_matrix_of_counts = feature_union.fit_transform(df)

此功能并集也可以作为较大管道中的第一步添加.

This feature union can also be added as the first step in a larger pipeline.

这篇关于在scikit-learn中使用Featureunion将两个 pandas 列合并为tfidf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆