如何将TFIDF功能与其他功能结合 [英] How to combine TFIDF features with other features

查看:69
本文介绍了如何将TFIDF功能与其他功能结合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个经典的NLP问题,我必须将新闻分类为假新闻或真实新闻.

I have a classic NLP problem, I have to classify a news as fake or real.

我创建了两组功能:

A)Bigram词频数-文档频率倒数

A) Bigram Term Frequency-Inverse Document Frequency

B)使用pattern.en( https://www.clips.uantwerpen.be/pages/pattern-en )作为文本的主观性,极性,#停用词,#动词,#主题,关系语法等...

B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...

对于单个预测,哪种方法是将TFIDF功能与其他功能结合在一起的最佳方法? 非常感谢大家.

Which is the best way to combine the TFIDF features with the other features for a single prediction? Thanks a lot to everyone.

推荐答案

不确定您是否在技术上询问如何在代码中组合两个对象或理论上该怎么做,所以我将尝试同时回答这两个问题.

Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.

从技术上讲,您的TFIDF只是一个矩阵,其中行是记录,列是要素.这样,您可以将新功能作为列附加到矩阵的末尾.如果使用sklearn执行此操作,则矩阵可能是稀疏矩阵(来自Scipy),因此您必须确保新功能也同时是稀疏矩阵(或使其他特征密集).

Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).

这为您提供了训练数据,就如何处理它而言,则要棘手一些.您从bigram频率矩阵获得的特征将是稀疏的(这里我不说数据结构,我只是说您将有很多0),并且它将是二进制的.同时您的其他数据是密集且连续的.尽管预测可能会由密集变量主导,但这将在大多数机器学习算法中运行.但是,通过一些特征工程,我过去使用树编码构建了多个分类器,这些编码将术语频率变量与其他一些更密集的变量进行了组合,并给出了增强的结果(例如,一个分类器用于查看twitter概要文件并进行分类他们作为公司或个人).通常,当我至少可以将密集变量归类为二进制(或分类然后热编码为二进制)以使它们不占优势时,我会发现更好的结果.

That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.

这篇关于如何将TFIDF功能与其他功能结合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆