如何在sklearn管道中适应不同的输入? [英] How to fit different inputs into an sklearn Pipeline?

查看:123
本文介绍了如何在sklearn管道中适应不同的输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sklearn中的Pipeline对文本进行分类.

在此示例管道中,我有一个TfIDF矢量化器,以及一些自定义功能,这些特征包装有FeatureUnion和一个分类器,作为管道的步骤,然后拟合训练数据并进行预测:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# load custom features and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
features.append(('ngram', countVecWord))

all_features = FeatureUnion(features)

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

pipeline = Pipeline(
    [('all', all_features ),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

# etc.

上面的代码可以正常工作,但是有一个错误.我想对文本进行词性标注,并在标记文本上使用其他Vectorizer.

X = ['I am a sentence', 'an example']
X_tagged = do_tagging(X) 
# X_tagged = ['PP AUX DET NN', 'DET NN']
Y = [1, 2]
X_dev = ['another sentence']
X_dev_tagged = do_tagging(X_dev)

# load custom featues and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
# new POS Vectorizer
countVecPOS = TfidfVectorizer(ngram_range=(1, 4), max_features= 2000)

features.append(('ngram', countVecWord))
features.append(('pos_ngram', countVecWord))

all_features = FeatureUnion(features)

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

pipeline = Pipeline(
    [('all', all_features ),
    ('clf', LinearSVC1),
    ])

# how do I fit both X and X_tagged here
# how can the different vectorizers get either X or X_tagged?
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

# etc.

如何正确拟合此类数据?两种向量化器如何区分原始文本和pos文本?我有什么选择?

我还具有自定义功能,其中一些将使用原始文本,而另一些将使用POS文本.

添加了MeasureFeatures()

from sklearn.base import BaseEstimator
import numpy as np

class MeasureFeatures(BaseEstimator):

    def __init__(self):
        pass

    def get_feature_names(self):
        return np.array(['type_token', 'count_nouns'])

    def fit(self, documents, y=None):
        return self

    def transform(self, x_dataset):


        X_type_token = list()
        X_count_nouns = list()

        for sentence in x_dataset:

            # takes raw text and calculates type token ratio
            X_type_token.append(type_token_ratio(sentence))

            # takes pos tag text and counts number of noun pos tags (NN, NNS etc.)
            X_count_nouns.append(count_nouns(sentence))

        X = np.array([X_type_token, X_count_nouns]).T

        print X
        print X.shape

        if not hasattr(self, 'scalar'):
            self.scalar = StandardScaler().fit(X)
        return self.scalar.transform(X)

然后,此功能转换器需要为count_nouns()函数获取带标签的文本,或者为type_token_ratio()获取原始文本

解决方案

我认为您必须对2个变压器(TfidfTransformer和 POSTransformer )执行 FeatureUnion .当然,您需要定义POSTransformer.
也许这文章会为您提供帮助.

也许您的管道看起来像这样.

pipeline = Pipeline([
  ('features', FeatureUnion([
    ('ngram_tf_idf', Pipeline([
      ('counts_ngram', CountVectorizer()),
      ('tf_idf_ngram', TfidfTransformer())
    ])),
    ('pos_tf_idf', Pipeline([
      ('pos', POSTransformer()),          
      ('counts_pos', CountVectorizer()),
      ('tf_idf_pos', TfidfTransformer())
    ])),
    ('measure_features', MeasureFeatures())
  ])),
  ('classifier', LinearSVC())
])

并且假设 MeasureFeatures POSTransformer 是符合sklearn API的Transformers.

I am using Pipeline from sklearn to classify text.

In this example Pipeline I have a TfIDF vectorizer and some custom features wrapped with FeatureUnion and a classifier as the Pipeline steps, I then fit the training data and do the prediction:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# load custom features and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
features.append(('ngram', countVecWord))

all_features = FeatureUnion(features)

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

pipeline = Pipeline(
    [('all', all_features ),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

# etc.

The above code works just fine, but there is a twist. I want to do Part of Speech Tagging on the text and to use a different Vectorizer on the tagget text.

X = ['I am a sentence', 'an example']
X_tagged = do_tagging(X) 
# X_tagged = ['PP AUX DET NN', 'DET NN']
Y = [1, 2]
X_dev = ['another sentence']
X_dev_tagged = do_tagging(X_dev)

# load custom featues and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
# new POS Vectorizer
countVecPOS = TfidfVectorizer(ngram_range=(1, 4), max_features= 2000)

features.append(('ngram', countVecWord))
features.append(('pos_ngram', countVecWord))

all_features = FeatureUnion(features)

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

pipeline = Pipeline(
    [('all', all_features ),
    ('clf', LinearSVC1),
    ])

# how do I fit both X and X_tagged here
# how can the different vectorizers get either X or X_tagged?
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

# etc.

How do I properly fit this kind of data? How can the two vectorizers differentiate between raw text and pos text? What are my options?

I also have custom features, some of them would take the raw text and others the POS text.

EDIT: Added MeasureFeatures()

from sklearn.base import BaseEstimator
import numpy as np

class MeasureFeatures(BaseEstimator):

    def __init__(self):
        pass

    def get_feature_names(self):
        return np.array(['type_token', 'count_nouns'])

    def fit(self, documents, y=None):
        return self

    def transform(self, x_dataset):


        X_type_token = list()
        X_count_nouns = list()

        for sentence in x_dataset:

            # takes raw text and calculates type token ratio
            X_type_token.append(type_token_ratio(sentence))

            # takes pos tag text and counts number of noun pos tags (NN, NNS etc.)
            X_count_nouns.append(count_nouns(sentence))

        X = np.array([X_type_token, X_count_nouns]).T

        print X
        print X.shape

        if not hasattr(self, 'scalar'):
            self.scalar = StandardScaler().fit(X)
        return self.scalar.transform(X)

This feature transformer then needs to either take tagged text for the count_nouns() function or the raw text for type_token_ratio()

解决方案

I think that you have to do a FeatureUnion on 2 Transformers (TfidfTransformer and POSTransformer). Of course you need to define that POSTransformer.
Maybe this article will help you.

Maybe your pipeline will look like this.

pipeline = Pipeline([
  ('features', FeatureUnion([
    ('ngram_tf_idf', Pipeline([
      ('counts_ngram', CountVectorizer()),
      ('tf_idf_ngram', TfidfTransformer())
    ])),
    ('pos_tf_idf', Pipeline([
      ('pos', POSTransformer()),          
      ('counts_pos', CountVectorizer()),
      ('tf_idf_pos', TfidfTransformer())
    ])),
    ('measure_features', MeasureFeatures())
  ])),
  ('classifier', LinearSVC())
])

And this assume that MeasureFeatures and POSTransformer are Transformers conform to the sklearn API.

这篇关于如何在sklearn管道中适应不同的输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆