使用管道和网格搜索执行特征选择 [英] Perform feature selection using pipeline and gridsearch

查看:105
本文介绍了使用管道和网格搜索执行特征选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为研究项目的一部分,我想选择预处理技术和文本功能的 best 最佳组合,它们可以优化文本分类任务的结果.为此,我使用的是Python 3.6.

As part of a research project, I want to select the best combination of preprocessing techniques and textual features that optimize the results of a text classification task. For this, I am using Python 3.6.

有很多方法可以将特征和算法结合起来,但是我想充分利用sklearn的流水线,并使用网格搜索的最终特征组合来测试所有不同的(有效)可能性.

There are a number of methods to combine features and algorithms, but I want to take full advantage of sklearn's pipelines and test all the different (valid) possibilities using grid search for the ultimate feature combo.

我的第一步是建立一个如下所示的管道:

My first step was to build a pipeline that looks like the following:

# Run a vectorizer with a predefined tweet tokenizer and a Naive Bayes

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
    ('nb', MultinomialNB())
])

parameters = {
'vectorizer__preprocessor': (None, preprocessor)
}

gs =  GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1, verbose=1)

在这个简单的示例中,矢量化器使用tweet_tokenizer对数据进行令牌化,然后测试哪种预处理选项(无或预定义的函数)效果更好.

In this simple example, the vectorizer tokenizes the data using tweet_tokenizer and then tests which option of preprocessing (None or a predefined function) results better.

这似乎是一个不错的开始,但是我现在正在努力寻找一种方法来测试预处理器功能(定义如下)中的所有不同可能性:

This seems like a decent start, but I am now struggling to find a way to test all the different possibilities within the preprocessor function, defined below:

def preprocessor(tweet):
    # Data cleaning
    tweet = URL_remover(tweet) # Removing URLs
    tweet = mentions_remover(tweet) # Removing mentions
    tweet = email_remover(tweet) # Removing emails
    tweet = irrelev_chars_remover(tweet) # Removing invalid chars
    tweet = emojies_converter(tweet) # Translating emojies
    tweet = to_lowercase(tweet) # Converting words to lowercase
    # Others
    tweet = hashtag_decomposer(tweet) # Hashtag decomposition
    # Punctuation may only be removed after hashtag decomposition  
    # because it considers "#" as punctuation
    tweet = punct_remover(tweet) # Punctuation 
    return tweet

结合所有不同处理技术的简单"解决方案将是为每种可能性创建不同的功能(例如funcA:proc1,funcB:proc1 + proc2,funcC:proc1 + proc3等)并设置网格参数如下:

A "simple" solution to combine all the different processing techniques would be to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.) and set the grid parameter as follows:

parameters = {
   'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
}

尽管这很可能会起作用,但这对于该任务而言不是可行或合理的解决方案,尤其是因为存在2^n_features不同的组合以及相应的功能.

Although this would most likely work, this isn't a viable or reasonable solution for this task, especially since there are 2^n_features different combinations and, consequently, functions.

最终目标是在管道中结合预处理技术和功能,以便使用gridsearch优化分类结果:

The ultimate goal is to combine both preprocessing techniques and features in a pipeline in order to optimize the results of the classification using gridsearch:

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer = tweet_tokenizer)),
    ('feat_extractor' , feat_extractor)
    ('nb', MultinomialNB())
])

 parameters = {
   'vectorizer__preprocessor': (None, funcA, funcB, funcC, ...)
   'feat_extractor': (None, func_A, func_B, func_C, ...)
 }

有没有更简单的方法来获得这个?

Is there a simpler way to obtain this?

推荐答案

根据您的描述,此解决方案非常粗糙,具体取决于答案,具体取决于所使用的数据类型.在制作管道之前,让我们了解CountVectorizer如何在传递给它的raw_documents上工作.本质上,这是一行将字符串文档处理为令牌

This solution is very rough based on your description and specific to the answer depending on the type of data used. Before making the pipeline, lets understand how the CountVectorizer works on the raw_documents that are passed in it. Essentially, this is the line that processes the string documents into tokens,

return lambda doc: self._word_ngrams(tokenize(preprocess(self.decode(doc))), stop_words)

然后将它们计数并转换为计数矩阵.

which are then just counted and converted to count matrix.

所以这里发生的是:

  1. decode:只需确定如何从文件读取数据(如果已指定).对我们来说没有用,因为我们已经将数据放入列表中.
  2. preprocess:如果CountVectorizer中的'strip_accents''lowercase'True,则执行以下操作.没什么

  1. decode: Just decide how to read the data from file (if specified). Not of use to us, where we already have data into list.
  2. preprocess: It does the following if 'strip_accents' and 'lowercase' are True in CountVectorizer. Else nothing

strip_accents(x.lower())

同样,没有用,因为我们正在将小写功能移到我们自己的预处理器上,并且因为我们已经在字符串列表中了数据,所以不需要去除重音符号.

Again, no use, because we are moving the lowercase functionality to our own preprocessor and dont need to strip accents because we already have data in list of strings.

tokenize:将删除所有标点符号,并且仅保留长度为2或更大的字母数字单词,并返回单个文档的标记列表(列表元素)

tokenize: Will remove all punctuations and retain only alphanumeric words of length 2 or more, and return a list of tokens for single document (element of list)

lambda doc: token_pattern.findall(doc)

请记住这一点.如果您想自己处理标点符号和其他符号(决定保留一些标点符号并删除其他符号),则最好更改默认的token_pattern=’(?u)\b\w\w+\b’CountVectorizer.

This should be kept in mind. If you want to handle the punctuation and other symbols yourself (deciding on keeping some and removing others), then better also change the default token_pattern=’(?u)\b\w\w+\b’ of CountVectorizer.

  1. _word_ngrams:此方法将首先从上一步的标记列表中删除停用词(作为上述参数提供),然后计算由CountVectorizer中的ngram_range参数定义的n_gram.如果要按自己的方式处理"n_grams",也应记住这一点.
  1. _word_ngrams: This method will first remove the stop words (supplied as parameter above) from the list of tokens from the previous step and then calculate the n_grams as defined by the ngram_range param in CountVectorizer. This should also be kept in mind, if you want to handle the "n_grams" your way.

注意:如果分析仪设置为'char',则将不执行tokenizer步骤,并且将由字符组成n_gram.

Note: If the analyzer is set to 'char', then the tokenizer step will be not be performed and n_grams will be made from characters.

现在进入我们的管道.这是我认为可以在此处使用的结构:

So now coming to our pipeline. This is the structure I am thinking can work here:

X --> combined_pipeline, Pipeline
            |
            |  Raw data is passed to Preprocessor
            |
            \/
         Preprocessor 
                 |
                 |  Cleaned data (still raw texts) is passed to FeatureUnion
                 |
                 \/
              FeatureUnion
                      |
                      |  Data is duplicated and passed to both parts
       _______________|__________________
      |                                  |
      |                                  |                         
      \/                                \/
   CountVectorizer                  FeatureExtractor
           |                                  |   
           |   Converts raw to                |   Extracts numerical features
           |   count-matrix                   |   from raw data
           \/________________________________\/
                             |
                             | FeatureUnion combines both the matrices
                             |
                             \/
                          Classifier

现在要编写代码.这是管道的样子:

Now coming to code. This is what the pipeline looks like:

# Imports
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion, Pipeline

# Pipeline
pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ('features', FeatureUnion([("vectorizer", CountVectorizer()),
                                            ("extractor", CustomFeatureExtractor())
                                            ]))
                 ('classifier', SVC())
                ])

其中CustomPreprocessorCustomFeatureExtractor定义为:

from sklearn.base import TransformerMixin, BaseEstimator

class CustomPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, remove_urls=True, remove_mentions=True, 
                 remove_emails=True, remove_invalid_chars=True, 
                 convert_emojis=True, lowercase=True, 
                 decompose_hashtags=True, remove_punctuations=True):
        self.remove_urls=remove_urls
        self.remove_mentions=remove_mentions
        self.remove_emails=remove_emails
        self.remove_invalid_chars=remove_invalid_chars
        self.convert_emojis=convert_emojis
        self.lowercase=lowercase
        self.decompose_hashtags=decompose_hashtags
        self.remove_punctuations=remove_punctuations

    # You Need to have all the functions ready
    # This method works on single tweets
    def preprocessor(self, tweet):
        # Data cleaning
        if self.remove_urls:
            tweet = URL_remover(tweet) # Removing URLs

        if self.remove_mentions:
            tweet = mentions_remover(tweet) # Removing mentions

        if self.remove_emails:
            tweet = email_remover(tweet) # Removing emails

        if self.remove_invalid_chars:
            tweet = irrelev_chars_remover(tweet) # Removing invalid chars

        if self.convert_emojis:
            tweet = emojies_converter(tweet) # Translating emojies

        if self.lowercase:
            tweet = to_lowercase(tweet) # Converting words to lowercase

        if self.decompose_hashtags:
            # Others
            tweet = hashtag_decomposer(tweet) # Hashtag decomposition

        # Punctuation may only be removed after hashtag decomposition  
        # because it considers "#" as punctuation
        if self.remove_punctuations:
            tweet = punct_remover(tweet) # Punctuation 

        return tweet

    def fit(self, raw_docs, y=None):
        # Noop - We dont learn anything about the data
        return self

    def transform(self, raw_docs):
        return [self.preprocessor(tweet) for tweet in raw_docs]

from textblob import TextBlob
import numpy as np
# Same thing for feature extraction
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, sentiment_analysis=True, tweet_length=True):
        self.sentiment_analysis=sentiment_analysis
        self.tweet_length=tweet_length

    # This method works on single tweets
    def extractor(self, tweet):
        features = []

        if self.sentiment_analysis:
            blob = TextBlob(tweet)
            features.append(blob.sentiment.polarity)

        if self.tweet_length:
            features.append(len(tweet))

        # Do for other features you want.

        return np.array(features)

    def fit(self, raw_docs, y):
        # Noop - Again I am assuming that We dont learn anything about the data
        # Definitely not for tweet length, and also not for sentiment analysis
        # Or any other thing you might have here.
        return self

    def transform(self, raw_docs):
        # I am returning a numpy array so that the FeatureUnion can handle that correctly
        return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))

最后,现在可以像下面这样轻松地完成参数网格的操作了:

Finally, the parameter grid can be now done easily like:

param_grid = ['preprocessor__remove_urls':[True, False],
              'preprocessor__remove_mentions':[True, False],
              ...
              ...
              # No need to search for lowercase or preprocessor in CountVectorizer 
              'features__vectorizer__max_df':[0.1, 0.2, 0.3],
              ...
              ...
              'features__extractor__sentiment_analysis':[True, False],
              'features__extractor__tweet_length':[True, False],
              ...
              ...
              'classifier__C':[0.01, 0.1, 1.0]
            ]

上面的代码是为了避免"to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.)".只需执行True,False和GridSearchCV即可处理.

The above code is to avoid "to create a different function for each possibility (e.g. funcA: proc1, funcB: proc1 + proc2, funcC: proc1 + proc3, etc.)". Just do True, False and GridSearchCV will handle that.

更新: 如果您不想使用CountVectorizer,则可以将其从管道和参数网格中删除,新的管道将为:

Update: If you dont want to have the CountVectorizer, then you can remove that from the pipeline and parameter grid and new pipeline will be:

pipe = Pipeline([('preprocessor', CustomPreprocessor()), 
                 ("extractor", CustomFeatureExtractor()),
                 ('classifier', SVC())
                ])

然后确保在CustomFeatureExtractor中实现所需的所有功能.如果这变得太复杂,那么您总是可以制作更简单的提取器,并在FeatureUnion中将它们组合在一起,以代替CountVectorizer

Then make sure to implement all the functionalities you want in CustomFeatureExtractor. If that becomes too complex, then you can always make simpler extractors and combine them together in the FeatureUnion in place of CountVectorizer

这篇关于使用管道和网格搜索执行特征选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆