如何为多个数据框列制作管道? [英] How to make pipeline for multiple dataframe columns?

查看:18
本文介绍了如何为多个数据框列制作管道?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有可以简化为这样的数据框:

I have Dataframe which can be simplified to this:

import pandas as pd

df = pd.DataFrame([{
'title': 'batman',
'text': 'man bat man bat', 
'url': 'batman.com', 
'label':1}, 
{'title': 'spiderman',
'text': 'spiderman man spider', 
'url': 'spiderman.com', 
'label':1},
{'title': 'doctor evil',
 'text': 'a super evil doctor', 
'url': 'evilempyre.com', 
'label':0},])

我想尝试不同的特征提取方法:TFIDF、word2vec、具有不同 ngram 设置的 Coutvectorizer 等.但我想以不同的组合尝试:一个特征集将包含用 TFIDF 转换的文本"数据,以及url' 与 Countvectoriser 和 second 将具有由 w2v 转换的文本数据,以及由 TFIDF 转换的 'url' 等等.当然,最后我想对不同的预处理策略进行比较并选择最好的.

And I want to try different feature extraction methods: TFIDF, word2vec, Coutvectorizer with different ngram settings, etc. But I want to try it in different combinations: one feature set will contain 'text' data transformed with TFIDF, and 'url' with Countvectoriser and second will have text data converted by w2v, and 'url' by TFIDF and so on. In the end, of course, I want to make a comparison of different preprocessing strategies and choose the best one.

这里是问题:

  1. 有没有办法使用像 Pipeline 这样的标准 sklearn 工具来做这样的事情?

  1. Is there a way to do such things using standard sklearn tools like Pipeline?

我的想法有常识吗?也许有一些好主意如何处理我缺少的 Dataframes 中多列的文本数据?

Is there a common sense in my idea? Maybe there are good ideas how to treat text data with many columns in Dataframes which I am missing?

非常感谢!

推荐答案

@elphz answer 很好地介绍了如何使用 FeatureUnionFunctionTransformer 来完成这个,但我认为它可以使用更多的细节.

@elphz answer is a good intro to how you could use FeatureUnion and FunctionTransformer to accomplish this, but I think it could use a little more detail.

首先我想说您需要定义您的 FunctionTransformer 函数,以便它们可以正确处理和返回您的输入数据.在这种情况下,我假设您只想传递 DataFrame,但请确保您返回一个正确形状的数组以供下游使用.因此,我建议只传递 DataFrame 并按列名访问.像这样:

First off I would say you need to define your FunctionTransformer functions such that they can handle and return your input data properly. In this case I assume you just want to pass the DataFrame, but ensure that you get back a properly shaped array for use downstream. Therefore I would propose passing just the DataFrame and accessing by column name. Like so:

def text(X):
    return X.text.values

def title(X):
    return X.title.values

pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])

pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])

现在,测试变压器和分类器的变化.我建议使用转换器列表和分类器列表,然后简单地遍历它们,就像网格搜索一样.

Now, to test the variations of transformers and classifiers. I would propose using a list of transformers and a list of classifiers and simply iterating through them, much like a gridsearch.

tfidf = TfidfVectorizer()
cv = CountVectorizer()
lr = LogisticRegression()
rc = RidgeClassifier()

transformers = [('tfidf', tfidf), ('cv', cv)]
clfs = [lr, rc]

best_clf = None
best_score = 0
for tran1 in transformers:
    for tran2 in transformers:
        pipe1 = Pipeline(pipe_text.steps + [tran1])
        pipe2 = Pipeline(pipe_title.steps + [tran2])
        union = FeatureUnion([('text', pipe1), ('title', pipe2)])
        X = union.fit_transform(df)
        X_train, X_test, y_train, y_test = train_test_split(X, df.label)
        for clf in clfs:
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
            if score > best_score:
                best_score = score
                best_est = clf

这是一个简单的示例,但您可以看到如何以这种方式插入各种转换和分类器.

This is a simple example, but you can see how you could plug in any variety of transformations and classifiers in this way.

这篇关于如何为多个数据框列制作管道?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆