使用文本字符串创建在Pandas数据框中出现的单词矩阵 [英] Create a matrix of words occurring in a Pandas data frame with text strings
问题描述
我有一个带有一列文本数据的Pandas数据框.我想将此文本数据的每一行与我感兴趣的单词列表进行比较.比较应得出一个矩阵,该矩阵显示该行数据的文本中单词(0或1)的出现.
I have a Pandas data frame with a column of text data. I want to compare each row of this text data with a list of words that I'm interested in. The comparison should result in a matrix that shows the occurrence of the word (0 or 1) in the text of that row of data.
输入数据帧:
text
That bear talks
The stone rocks
Tea is boiling
The bear drinks tea
输入单词列表:
[bear, talks, tea]
结果:
text bear talks tea
That bear talks 1 1 0
The stone rocks 0 0 0
Tea is boiling 0 0 1
The bear drinks tea 1 0 1
我在sklearn.feature_extraction.text.HashingVectorizer上找到了一些信息,但是据我了解,它只是获取整个数据帧并将其分解为组成词并计数.我想做的是在一个非常有限的列表上进行.
I found some information on sklearn.feature_extraction.text.HashingVectorizer but from what I understand it just takes the whole data frame and breaks it down in component words and counts those. What I want to do is do it on a very limited list.
使用sklearn,我执行了以下操作:
With sklearn I did the following:
from sklearn.feature_extraction.text import HashingVectorizer
countvec = HashingVectorizer()
countvec.fit_transform(resultNLdf2.text)
但这给了我以下内容:
<73319x1048576 sparse matrix of type '<class 'numpy.float64'>'
with 1105683 stored elements in Compressed Sparse Row format>
除非我可以从这个稀疏矩阵中选择想要的单词,否则我的工作似乎有点困难,但我不知道该如何使用它.
Which seems a bit big to work with unless I could select on the words I want from this sparse matrix but I don't know how to work with it.
很抱歉,如果我用错误的单词来解释这个问题,不确定是否可以将其称为矩阵.
I'm sorry if I used the wrong words to explain this problem, not sure if you would call this a matrix for example.
修改
我正在处理的真实数据相当大,带有推文字符串的1264555行.至少我学会了不要过度简化问题:-p.这使得某些给定的解决方案(感谢帮助!!)由于内存问题或运行速度非常慢而无法使用.这也是我看sklearn的原因.
The true data i'm working on is rather large, 1264555 rows with strings of tweets. At least i've learned not to over simplify a problem :-p. This makes some of the given solutions (Thanks for trying to help!!) not work because of memory issues or just being extremely slow. This was also a reason I was looking at sklearn.
具有:
from sklearn.feature_extraction.text import CountVectorizer
words = ['bear', 'talks', 'tea']
countvec = CountVectorizer(vocabulary=words)
countvec.fit_transform(resultNLdf2.text)
您实际上可以通过提供一个简单列表来限制要查看的单词.但这给我带来了一个问题,那就是我不确定如何处理上述格式.
you can actually limit the words you want to look at by giving a simple list. But this leaves me with the problem that it is in a format I'm not sure what to do with as described above.
推荐答案
您可以使用use 查看全文