自定义FeatureUnion无法正常工作? [英] Custom FeatureUnion won't work?

查看:93
本文介绍了自定义FeatureUnion无法正常工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试修改示例以使用熊猫数据框而不是测试数据集.我无法这样做,因为ItemSelector似乎无法识别列名.

I'm trying to modify this example to use a Pandas dataframe instead of the test datasets. I am not able to do so, as ItemSelector does not seem to recognise the column name.

请注意数据框df_resolved.columns返回的列:

Please do note the columns of the dataframe df_resolved.columns returns:

Index(['u_category', ... ... 'resolution_time', 'rawtext'],
      dtype='object')

所以我显然在我的数据框中确实有这个.

So I obviously do have this in my dataframe.

但是,当我尝试运行解决方案时,出现错误

However, when I try to run the solution, I get the error

"ValueError:没有名称为u_category的字段"

"ValueError: no field of name u_category"

此外,我似乎无法修改代码以支持选择ItemSelector中的多列,因此在此解决方案中,我必须将转换器与每一列分开应用.

Also, I don't seem to be able to modify the code to support choosing multiple columns in the ItemSelector, so in this solution, I'd have to apply the transformers separately with each column.

我的代码是:

import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC


class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]


class TextStats(BaseEstimator, TransformerMixin):
    """Extract features from each document for DictVectorizer"""

    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        return [{'length': len(text),
                 'num_sentences': text.count('.')}
                for text in posts]


class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
    """Extract the subject & body from a usenet post in a single pass.

    Takes a sequence of strings and produces a dict of sequences.  Keys are
    `subject` and `body`.
    """
    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        features = np.recarray(shape=(len(posts),),
                               dtype=[('subject', object), ('body', object)])
        for i, text in enumerate(posts):
            headers, _, bod = text.partition('\n\n')
            bod = strip_newsgroup_footer(bod)
            bod = strip_newsgroup_quoting(bod)
            features['body'][i] = bod

            prefix = 'Subject:'
            sub = ''
            for line in headers.split('\n'):
                if line.startswith(prefix):
                    sub = line[len(prefix):]
                    break
            features['subject'][i] = sub

        return features


pipeline = Pipeline([
    # Extract the subject & body
    ('subjectbody', SubjectBodyExtractor()),

    # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(
        transformer_list=[

            # Pipeline for pulling features from the post's subject line
            ('rawtext', Pipeline([
                ('selector', ItemSelector(key='u_category')),
                ('labelenc', preprocessing.LabelEncoder()),
            ])),
            # Pipeline for standard bag-of-words model for body
            ('features', Pipeline([
                ('selector', ItemSelector(key='rawtext')),
                ('tfidf', TfidfVectorizer(max_df=0.5, min_df=1, 
                                          stop_words='english', 
                                          token_pattern=u'(?ui)\\b\\w*[a-z]{2,}\\w*\\b')),
            ])),
        ],

        # weight components in FeatureUnion
        transformer_weights={
            'rawtext': 1.0,
            'features': 1.0,
        },
    )),

    # Use a SVC classifier on the combined features
    ('linear_svc', LinearSVC(penalty="l2")),
])

# limit the list of categories to make running this example faster.
X_train, X_test, y_train, y_test = train_test_split(df_resolved.ix[:, (df_resolved.columns != 'assignment_group.name')], df_resolved['assignment_group.name'], test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))

如何修改此代码以使其与数据框一起正常工作,并可能支持一次将转换器应用于多个列?

How can I modify this code to work properly with my dataframe, and possibly support applying a transformer to multiple columns at once?

如果我将ItemSelector取出,它似乎可以工作.这样就可以了:

If I take the ItemSelector out, it seems to work. So this works:

ds = ItemSelector(key='u_category')
ds.fit(df_resolved)

labelenc = preprocessing.LabelEncoder()
labelenc_transformed = labelenc.fit_transform(ds.transform(df_resolved))

完整堆栈跟踪:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-93-a4ba29c137ec> in <module>()
    136 
    137 
--> 138 pipeline.fit(X_train, y_train)
    139 #y = pipeline.predict(X_test)
    140 #print(classification_report(y, test.target))

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    266             This estimator
    267         """
--> 268         Xt, fit_params = self._fit(X, y, **fit_params)
    269         if self._final_estimator is not None:
    270             self._final_estimator.fit(Xt, y, **fit_params)

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    232                 pass
    233             elif hasattr(transform, "fit_transform"):
--> 234                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    235             else:
    236                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    732             delayed(_fit_transform_one)(trans, name, weight, X, y,
    733                                         **fit_params)
--> 734             for name, trans, weight in self._iter())
    735 
    736         if not result:

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    756             # was dispatched. In particular this covers the edge
    757             # case of Parallel used with an exhausted iterator.
--> 758             while self.dispatch_one_batch(iterator):
    759                 self._iterating = True
    760             else:

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    606                 return False
    607             else:
--> 608                 self._dispatch(tasks)
    609                 return True
    610 

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    569         dispatch_timestamp = time.time()
    570         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571         job = self._backend.apply_async(batch, callback=cb)
    572         self._jobs.append(job)
    573 

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
    107     def apply_async(self, func, callback=None):
    108         """Schedule a func to be run"""
--> 109         result = ImmediateResult(func)
    110         if callback:
    111             callback(result)

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
    324         # Don't delay the application, to avoid keeping the input
    325         # arguments in memory
--> 326         self.results = batch()
    327 
    328     def get(self):

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, name, weight, X, y, **fit_params)
    575                        **fit_params):
    576     if hasattr(transformer, 'fit_transform'):
--> 577         res = transformer.fit_transform(X, y, **fit_params)
    578     else:
    579         res = transformer.fit(X, y, **fit_params).transform(X)

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    299         """
    300         last_step = self._final_estimator
--> 301         Xt, fit_params = self._fit(X, y, **fit_params)
    302         if hasattr(last_step, 'fit_transform'):
    303             return last_step.fit_transform(Xt, y, **fit_params)

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    232                 pass
    233             elif hasattr(transform, "fit_transform"):
--> 234                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    235             else:
    236                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    495         else:
    496             # fit method of arity 2 (supervised transformation)
--> 497             return self.fit(X, y, **fit_params).transform(X)
    498 
    499 

<ipython-input-93-a4ba29c137ec> in transform(self, data_dict)
     55 
     56     def transform(self, data_dict):
---> 57         return data_dict[self.key]
     58 
     59 

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/numpy/core/records.py in __getitem__(self, indx)
    497 
    498     def __getitem__(self, indx):
--> 499         obj = super(recarray, self).__getitem__(indx)
    500 
    501         # copy behavior of getattr, except that here

ValueError: no field of name u_category

更新:

即使我使用数据帧(否train_test_split),问题仍然存在:

Even if I use dataframes (NO train_test_split), the issue persists:

更新2: 好,所以我删除了SubjectBodyExtractor,因为我不需要它了.现在ValueError: no field of name u_category不见了,但是我有一个新错误:TypeError: fit_transform() takes 2 positional arguments but 3 were given.

UPDATE 2: OK so I removed the SubjectBodyExtractor, since I won't need that. Now the ValueError: no field of name u_category is gone, but I have a new error: TypeError: fit_transform() takes 2 positional arguments but 3 were given.

堆栈跟踪:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-110-292294015e44> in <module>()
    129 
    130 
--> 131 pipeline.fit(X_train.ix[:, (X_test.columns != 'assignment_group.name')], X_test['assignment_group.name'])
    132 #y = pipeline.predict(X_test)
    133 #print(classification_report(y, test.target))

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    266             This estimator
    267         """
--> 268         Xt, fit_params = self._fit(X, y, **fit_params)
    269         if self._final_estimator is not None:
    270             self._final_estimator.fit(Xt, y, **fit_params)

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    232                 pass
    233             elif hasattr(transform, "fit_transform"):
--> 234                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    235             else:
    236                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    732             delayed(_fit_transform_one)(trans, name, weight, X, y,
    733                                         **fit_params)
--> 734             for name, trans, weight in self._iter())
    735 
    736         if not result:

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    756             # was dispatched. In particular this covers the edge
    757             # case of Parallel used with an exhausted iterator.
--> 758             while self.dispatch_one_batch(iterator):
    759                 self._iterating = True
    760             else:

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    606                 return False
    607             else:
--> 608                 self._dispatch(tasks)
    609                 return True
    610 

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    569         dispatch_timestamp = time.time()
    570         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571         job = self._backend.apply_async(batch, callback=cb)
    572         self._jobs.append(job)
    573 

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
    107     def apply_async(self, func, callback=None):
    108         """Schedule a func to be run"""
--> 109         result = ImmediateResult(func)
    110         if callback:
    111             callback(result)

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
    324         # Don't delay the application, to avoid keeping the input
    325         # arguments in memory
--> 326         self.results = batch()
    327 
    328     def get(self):

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, name, weight, X, y, **fit_params)
    575                        **fit_params):
    576     if hasattr(transformer, 'fit_transform'):
--> 577         res = transformer.fit_transform(X, y, **fit_params)
    578     else:
    579         res = transformer.fit(X, y, **fit_params).transform(X)

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    301         Xt, fit_params = self._fit(X, y, **fit_params)
    302         if hasattr(last_step, 'fit_transform'):
--> 303             return last_step.fit_transform(Xt, y, **fit_params)
    304         elif last_step is None:
    305             return Xt

TypeError: fit_transform() takes 2 positional arguments but 3 were given

推荐答案

是的,那是因为LabelEncoder仅需要单个数组y,而FeatureUnion将尝试向其发送X和y.

Yes, thats because LabelEncoder only requires a single array y whereas FeatureUnion will try sending X and y both to it.

查看此内容: https://github.com/scikit-learn/scikit-learn/issues/3956

您可以为此使用简单的解决方法:

You can use a simple workaround for this:

定义这样的自定义labelEncoder:

Define a custom labelEncoder like this:

class MyLabelEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.le = LabelEncoder()

    def fit(self, x, y=None):
        return self.le.fit(x)

    def transform(self, x, y=None):
        return self.le.transform(x).reshape(-1,1)

    def fit_transform(self, x, y=None):
        self.fit(x)
        return self.transform(x)

然后在管道中执行以下操作:

And in the pipeline, do this:

....
....
                ('selector', ItemSelector(key='u_category')),
                ('labelenc', MyLabelEncoder()),

请注意trasform()方法中的reshape(-1,1).那是因为FeatureUnion仅适用于二维数据. FeatureUnion内部的所有单个转换器都只能返回二维数据.

Please note the reshape(-1,1) in the trasform() method. Thats because FeatureUnion only works with 2-d data. All the individual transformers inside the FeatureUnion should only return 2-d data.

这篇关于自定义FeatureUnion无法正常工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆