如何在当前单词分类中添加另一个功能(文本长度)? Scikit学习 [英] How to add another feature (length of text) to current bag of words classification? Scikit-learn

查看：55 发布时间：2020/5/4 8:56:59 python machine-learning scikit-learn classification text-classification

本文介绍了如何在当前单词分类中添加另一个功能(文本长度)? Scikit学习的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在用一堆单词对文本进行分类.它运行良好，但我想知道如何添加一个单词所不能提供的功能.

I am using bag of words to classify text. It's working well but I am wondering how to add a feature which is not a word.

这是我的示例代码.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = [[0],[0],[0],[0],[1],[1],[1],[1]]

X_test = np.array(["it's a nice day in nyc",
                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                   ])   
target_names = ['Class 1', 'Class 2']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

现在很明显，有关伦敦的文字要比有关纽约的文字长得多.如何将文本长度添加为特征? 我是否必须使用另一种分类方式，然后将两个预测结合起来?有什么办法和这句话一起做吗? 一些示例代码将是很棒的-我对机器学习和scikit学习是非常陌生的.

Now it is clear that the text about London tends to be much longer than the text about New York. How would I add length of the text as a feature? Do I have to use another way of classification and then combine the two predictions? Is there any way of doing it along with the bag of words? Some sample code would be great -- I'm very new to machine learning and scikit learn.

推荐答案

如注释所示，这是FunctionTransformer，FeaturePipeline和FeatureUnion的组合.

As shown in the comments, this is a combination of a FunctionTransformer, a FeaturePipeline and a FeatureUnion.

import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import FunctionTransformer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]])

X_test = np.array(["it's a nice day in nyc",
                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                   ])   
target_names = ['Class 1', 'Class 2']


def get_text_length(x):
    return np.array([len(t) for t in x]).reshape(-1, 1)

classifier = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
            ('tfidf', TfidfTransformer()),
        ])),
        ('length', Pipeline([
            ('count', FunctionTransformer(get_text_length, validate=False)),
        ]))
    ])),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
predicted

这会将文本的长度添加到分类器使用的功能中.

This will add the length of the text to the features used by the classifier.

这篇关于如何在当前单词分类中添加另一个功能(文本长度)? Scikit学习的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在当前单词分类中添加另一个功能(文本长度)? Scikit学习 [英] How to add another feature (length of text) to current bag of words classification? Scikit-learn

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何在当前单词分类中添加另一个功能(文本长度)? Scikit学习 [英] How to add another feature (length of text) to current bag of words classification? Scikit-learn

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭