sklearn中带有词袋和附加情感特征的文本分类器 [英] text classifier with bag of words and additional sentiment feature in sklearn

查看:74
本文介绍了sklearn中带有词袋和附加情感特征的文本分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个分类器,除了词袋之外,还使用情绪或主题(LDA 结果)等特征.我有一个带有文本和标签的 Pandas DataFrame,我想添加一个情绪值(介于 -5 和 5 之间的数值)和 LDA 分析的结果(带有句子主题的字符串).

I am trying to build a classifier that in addition to bag of words uses features like the sentiment or a topic (LDA result). I have a pandas DataFrame with the text and the label and would like to add a sentiment value (numerical between -5 and 5) and the result of LDA analysis (a string with the topic of the sentence).

我有一个工作袋分类器,它使用来自 sklearn 的 CountVectorizer 并使用 MultinomialNaiveBayes 执行分类.

I have a working bag of words classifier that uses CountVectorizer from sklearn and performs the classification with MultinomialNaiveBayes.

df = pd.DataFrame.from_records(data=data, columns=names)
train, test = train_test_split(
    df,
    train_size=train_ratio,
    random_state=1337
)
train_df = pd.DataFrame(train, columns=names)
test_df = pd.DataFrame(test, columns=names)
vectorizer = CountVectorizer()
train_matrix = vectorizer.fit_transform(train_df['text'])
test_matrix = vectorizer.transform(test_df['text'])
positive_cases_train = (train_df['label'] == 'decision')
positive_cases_test = (test_df['label'] == 'decision')
classifier = MultinomialNB()
classifier.fit(train_matrix, positive_cases_train)

问题是现在.除了词袋技术之外,我如何才能将其他特征引入我的分类器?

The question is now. How can I additionally to the bag of words technique introduce the other features to my classifier?

提前致谢,如果您需要更多信息,我很乐意提供.

Thanks in advance and if you need more information I am glad to provide those.

在添加了@Guiem 建议的行之后,关于新功能权重的新问题.此编辑增加了该新问题:

After adding the rows like suggested by @Guiem a new question regarding weight of the new feature. This Edit adds to that new question:

我的火车矩阵的形状是 (2554, 5286).奇怪的是,无论是否添加了情感列,它都是这个形状(也许行添加不正确?)

The shape of my train matrix is (2554, 5286). The weird thing though is that it is this shape with and without the sentiment column added (Maybe the row is not added properly?)

如果我打印矩阵,我会得到以下输出:

If I print the Matrix I get the following output:

  (0, 322)  0.0917594575712
  (0, 544)  0.196910480455
  (0, 556)  0.235630958238
  (0, 706)  0.137241420774
  (0, 1080) 0.211125349374
  (0, 1404) 0.216326271935
  (0, 1412) 0.191757369869
  (0, 2175) 0.128800602511
  (0, 2176) 0.271268708356
  (0, 2371) 0.123979845513
  (0, 2523) 0.406583720526
  (0, 3328) 0.278476810585
  (0, 3752) 0.203741786877
  (0, 3847) 0.301505063552
  (0, 4098) 0.213653538407
  (0, 4664) 0.0753937554096
  (0, 4676) 0.164498844366
  (0, 4738) 0.0844966331512
  (0, 4814) 0.251572721805
  (0, 5013) 0.201686066537
  (0, 5128) 0.21174469759
  (0, 5135) 0.187485844479
  (1, 291)  0.227264696182
  (1, 322)  0.0718526940442
  (1, 398)  0.118905396285
  : :
  (2553, 3165)  0.0985290985889
  (2553, 3172)  0.134514497354
  (2553, 3217)  0.0716087169489
  (2553, 3241)  0.172404983302
  (2553, 3342)  0.145912701013
  (2553, 3498)  0.149172538211
  (2553, 3772)  0.140598133976
  (2553, 4308)  0.0704700896603
  (2553, 4323)  0.0800039075449
  (2553, 4505)  0.163830579067
  (2553, 4663)  0.0513678549359
  (2553, 4664)  0.0681930862174
  (2553, 4738)  0.114639856277
  (2553, 4855)  0.140598133976
  (2553, 4942)  0.138370066422
  (2553, 4967)  0.143088901589
  (2553, 5001)  0.185244190321
  (2553, 5008)  0.0876615764151
  (2553, 5010)  0.108531807984
  (2553, 5053)  0.136354534152
  (2553, 5104)  0.0928665728295
  (2553, 5148)  0.171292088292
  (2553, 5152)  0.172404983302
  (2553, 5191)  0.104762377866
  (2553, 5265)  0.123712025565

希望对您有所帮助,还是您需要其他信息?

I hope that helps a little or did you want some other information?

推荐答案

一种选择是将这两个新功能作为添加到您的 CountVectorizer 矩阵中.

One option would be to just add these two new features to your CountVectorizer matrix as columns.

由于您没有执行任何 tf-idf,您的计数矩阵将用整数填充,因此您可以将新列编码为 int 值.

As you are not performing any tf-idf, your count matrix is going to be filled with integers so you could encode your new columns as int values.

您可能需要尝试多种编码,但可以从以下内容开始:

You might have to try several encodings but you can start with something like:

  • 情绪 [-5,...,5] 转换为 [0,...,10]
  • 带有句子主题的字符串.只需将整数分配给不同的主题({'unicorns':0, 'batman':1, ...}),您可以保留一个字典结构来分配整数并避免重复主题.
  • sentiment [-5,...,5] transformed to [0,...,10]
  • string with topic of sentence. Just assign integers to different topics ({'unicorns':0, 'batman':1, ...}), you can keep a dictionary structure to assign integers and avoid repeating topics.

以防万一您不知道如何向 train_matrix 添加列:

And just in case you don't know how to add columns to your train_matrix:

dense_matrix = train_matrix.todense() # countvectorizer returns a sparse matrix
np.insert(dense_matrix,dense_matrix.shape[1],[val1,...,valN],axis=1)

注意 [val1,...,valN] 列需要与 num 具有相同的长度.您正在使用的样本

note that the column [val1,...,valN] needs to have the same lenght as num. samples you are using

尽管它不再是严格意义上的词袋(因为并非所有列都代表词频),但只需添加这两列即可添加您想要包含的额外信息.并且朴素贝叶斯分类器考虑每个特征对概率的独立贡献,所以我们在这里没问题.

Even though it won't be strictly a Bag of Words anymore (because not all columns represent word frequency), just adding this two columns will add up the extra information you want to include. And naive Bayes classifier considers each of the features to contribute independently to the probability, so we are okay here.

更新:最好使用one hot"编码器来编码分类特征(http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).这样你就可以通过为你的新特征分配整数值来防止奇怪的行为(也许你仍然可以用情绪来做到这一点,因为在从 0 到 10 的情绪范围中,你假设 9 情绪更接近情绪 10 的样本,而不是另一个情绪为 0).但是对于分类特征,您最好进行 one-hot 编码.因此,假设您有 3 个主题,然后您可以使用相同的添加列的技术,只是现在您必须添加 3 个而不是一个 [topic1,topic2,topic3].这样,如果您有一个属于 topic1 的样本,您会将其编码为 [1 , 0 , 0],如果是 topic3,您的表示是 [0, 0, 1] (您用 1 标记对应的列主题)

Update: better use a 'one hot' encoder to encode categorical features (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). This way you prevent weird behavior by assigning integer values to your new features (maybe you can still do that with sentiment, because in a scale of sentiment from 0 to 10 you assume that a 9 sentiment is closer to a sample with sentiment 10 rather than another with sentiment 0). But with categorical features you better do the one-hot encoding. So let's say you have 3 topics, then you can use same technique of adding columns only now you have to add 3 instead of one [topic1,topic2,topic3]. This way if you have a sample that belongs to topic1, you'll encode this as [1 , 0 , 0], if that's topic3, your representation is [0, 0, 1] (you mark with 1 the column that corresponds to the topic)

这篇关于sklearn中带有词袋和附加情感特征的文本分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆