使用文本字符串创建在Pandas数据框中出现的单词矩阵 [英] Create a matrix of words occurring in a Pandas data frame with text strings

查看:80
本文介绍了使用文本字符串创建在Pandas数据框中出现的单词矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有一列文本数据的Pandas数据框.我想将此文本数据的每一行与我感兴趣的单词列表进行比较.比较应得出一个矩阵,该矩阵显示该行数据的文本中单词(0或1)的出现.

I have a Pandas data frame with a column of text data. I want to compare each row of this text data with a list of words that I'm interested in. The comparison should result in a matrix that shows the occurrence of the word (0 or 1) in the text of that row of data.

输入数据帧:

text
That bear talks
The stone rocks
Tea is boiling
The bear drinks tea

输入单词列表:

[bear, talks, tea]

结果:

text                 bear  talks  tea
That bear talks      1     1      0
The stone rocks      0     0      0
Tea is boiling       0     0      1
The bear drinks tea  1     0      1

我在sklearn.feature_extraction.text.HashingVectorizer上找到了一些信息,但是据我了解,它只是获取整个数据帧并将其分解为组成词并计数.我想做的是在一个非常有限的列表上进行.

I found some information on sklearn.feature_extraction.text.HashingVectorizer but from what I understand it just takes the whole data frame and breaks it down in component words and counts those. What I want to do is do it on a very limited list.

使用sklearn,我执行了以下操作:

With sklearn I did the following:

from sklearn.feature_extraction.text import HashingVectorizer

countvec = HashingVectorizer()

countvec.fit_transform(resultNLdf2.text)

但这给了我以下内容:

<73319x1048576 sparse matrix of type '<class 'numpy.float64'>'
    with 1105683 stored elements in Compressed Sparse Row format>

除非我可以从这个稀疏矩阵中选择想要的单词,否则我的工作似乎有点困难,但我不知道该如何使用它.

Which seems a bit big to work with unless I could select on the words I want from this sparse matrix but I don't know how to work with it.

很抱歉,如果我用错误的单词来解释这个问题,不确定是否可以将其称为矩阵.

I'm sorry if I used the wrong words to explain this problem, not sure if you would call this a matrix for example.

修改

我正在处理的真实数据相当大,带有推文字符串的1264555行.至少我学会了不要过度简化问题:-p.这使得某些给定的解决方案(感谢帮助!!)由于内存问题或运行速度非常慢而无法使用.这也是我看sklearn的原因.

The true data i'm working on is rather large, 1264555 rows with strings of tweets. At least i've learned not to over simplify a problem :-p. This makes some of the given solutions (Thanks for trying to help!!) not work because of memory issues or just being extremely slow. This was also a reason I was looking at sklearn.

具有:

from sklearn.feature_extraction.text import CountVectorizer

words = ['bear', 'talks', 'tea']

countvec = CountVectorizer(vocabulary=words)

countvec.fit_transform(resultNLdf2.text)

您实际上可以通过提供一个简单列表来限制要查看的单词.但这给我带来了一个问题,那就是我不确定如何处理上述格式.

you can actually limit the words you want to look at by giving a simple list. But this leaves me with the problem that it is in a format I'm not sure what to do with as described above.

推荐答案

您可以使用use 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆