如何为多个数据框列创建管道? [英] How to make pipeline for multiple dataframe columns?

查看:67
本文介绍了如何为多个数据框列创建管道?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,可以简化为:

I have Dataframe which can be simplified to this:

import pandas as pd

df = pd.DataFrame([{
'title': 'batman',
'text': 'man bat man bat', 
'url': 'batman.com', 
'label':1}, 
{'title': 'spiderman',
'text': 'spiderman man spider', 
'url': 'spiderman.com', 
'label':1},
{'title': 'doctor evil',
 'text': 'a super evil doctor', 
'url': 'evilempyre.com', 
'label':0},])

我想尝试不同的特征提取方法:TFIDF,word2vec,具有不同ngram设置的Coutvectorizer等.但是,我想以不同的组合进行尝试:一个特征集将包含用TFIDF转换的文本"数据,以及带有Countvectoriser的"url"和"second"将具有通过w2v转换的文本数据,以及由TFIDF等获得的"url".最后,当然,我想比较不同的预处理策略,然后选择最佳的.

And I want to try different feature extraction methods: TFIDF, word2vec, Coutvectorizer with different ngram settings, etc. But I want to try it in different combinations: one feature set will contain 'text' data transformed with TFIDF, and 'url' with Countvectoriser and second will have text data converted by w2v, and 'url' by TFIDF and so on. In the end, of course, I want to make a comparison of different preprocessing strategies and choose the best one.

这是问题:

  1. 是否可以使用标准的sklearn工具(例如管道)来执行此类操作?

  1. Is there a way to do such things using standard sklearn tools like Pipeline?

我的想法有常识吗?也许有个好主意,如何处理我缺少的数据框中有很多列的文本数据?

Is there a common sense in my idea? Maybe there are good ideas how to treat text data with many columns in Dataframes which I am missing?

非常感谢!

推荐答案

@elphz答案很好地介绍了如何使用

@elphz answer is a good intro to how you could use FeatureUnion and FunctionTransformer to accomplish this, but I think it could use a little more detail.

首先,我要说的是,您需要定义您的FunctionTransformer函数,以便它们可以正确处理和返回您的输入数据.在这种情况下,我假设您只想传递DataFrame,但要确保您获得正确形状的数组以供下游使用.因此,我建议仅传递DataFrame并按列名进行访问.像这样:

First off I would say you need to define your FunctionTransformer functions such that they can handle and return your input data properly. In this case I assume you just want to pass the DataFrame, but ensure that you get back a properly shaped array for use downstream. Therefore I would propose passing just the DataFrame and accessing by column name. Like so:

def text(X):
    return X.text.values

def title(X):
    return X.title.values

pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])

pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])

现在,要测试变形器和分类器的变体.我会建议使用一个转换器列表和一个分类器列表,并简单地对其进行迭代,就像使用gridsearch一样.

Now, to test the variations of transformers and classifiers. I would propose using a list of transformers and a list of classifiers and simply iterating through them, much like a gridsearch.

tfidf = TfidfVectorizer()
cv = CountVectorizer()
lr = LogisticRegression()
rc = RidgeClassifier()

transformers = [('tfidf', tfidf), ('cv', cv)]
clfs = [lr, rc]

best_clf = None
best_score = 0
for tran1 in transformers:
    for tran2 in transformers:
        pipe1 = Pipeline(pipe_text.steps + [tran1])
        pipe2 = Pipeline(pipe_title.steps + [tran2])
        union = FeatureUnion([('text', pipe1), ('title', pipe2)])
        X = union.fit_transform(df)
        X_train, X_test, y_train, y_test = train_test_split(X, df.label)
        for clf in clfs:
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
            if score > best_score:
                best_score = score
                best_est = clf

这是一个简单的示例,但是您可以看到如何以这种方式插入各种转换和分类器.

This is a simple example, but you can see how you could plug in any variety of transformations and classifiers in this way.

这篇关于如何为多个数据框列创建管道?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆