如何将 TFIDF 特征与其他特征结合起来 [英] How to combine TFIDF features with other features

查看:36
本文介绍了如何将 TFIDF 特征与其他特征结合起来的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个经典的 NLP 问题,我必须将新闻分类为假新闻或真实新闻.

I have a classic NLP problem, I have to classify a news as fake or real.

我创建了两组功能:

A) 二元词频-逆文档频率

A) Bigram Term Frequency-Inverse Document Frequency

B) 使用 pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) 作为文本的主观性、极性、#stopwords、#verbs、#subject、关系语法等......

B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...

将 TFIDF 特征与其他特征结合起来进行单个预测的最佳方法是什么?非常感谢大家.

Which is the best way to combine the TFIDF features with the other features for a single prediction? Thanks a lot to everyone.

推荐答案

不确定您是问技术上如何在代码中组合两个对象,或者理论上之后要做什么,所以我会尝试回答.

Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.

从技术上讲,您的 TFIDF 只是一个矩阵,其中行是记录,列是特征.因此,要进行组合,您可以将新功能作为列附加到矩阵的末尾.如果您使用 sklearn 执行此操作,则您的矩阵可能是一个稀疏矩阵(来自 Scipy),因此您必须确保您的新特征也是一个稀疏矩阵(或使另一个稠密).

Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).

这为您提供了您的训练数据,就如何处理它而言,它有点棘手.你的二元频率矩阵的特征将是稀疏的(我不是在这里谈论数据结构,我的意思是你会有很多 0),它是二进制的.而您的其他数据是密集且连续的.这将在大多数机器学习算法中运行,尽管预测可能由密集变量主导.然而,通过一些特征工程,我过去使用树集成构建了几个分类器,这些分类器将词频变量与其他一些更密集的变量相结合并提供增强的结果(例如,一个查看 Twitter 配置文件并分类的分类器他们作为公司或人).通常,当我至少可以将密集变量分箱为二进制(或分类,然后热编码为二进制)时,我会发现更好的结果,这样它们就不会占主导地位.

That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.

这篇关于如何将 TFIDF 特征与其他特征结合起来的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆