如何使用python-pandas和gensim将数据框中的单词映射到整数ID? [英] How to map the word in data frame to integer ID with python-pandas and gensim?
问题描述
给出这样一个数据框,包括项目和相应的评论文本:
Given such a data frame, including the item and corresponding review texts:
item_id review_text
B2JLCNJF16 i was attracted to this...
B0009VEM4U great snippers...
我想映射review_text
中最常见的5000
单词,所以结果数据框应该像这样:
I want to map the top 5000
most frequent word in review_text
, so the resulting data frame should be like:
item_id review_text
B2JLCNJF16 1 2 3 4 5...
B0009VEM4U 6... #as the word "snippers" is out of the top 5000 most frequent word
或者,最好选择一个词袋向量:
Or, a bag-of-word vector is highly preferred:
item_id review_text
B2JLCNJF16 [1,1,1,1,1....]
B0009VEM4U [0,0,0,0,0,1....]
我该怎么做?非常感谢!
How can I do that? Thanks a lot!
我已经尝试过@ayhan的答案.现在,我已经成功地将评论文本更改为doc2bow
形式:
I have tried @ayhan 's answer. Now I have successfully changed the review text to a doc2bow
form:
item_id review_text
B2JLCNJF16 [(123,2),(130,3),(159,1)...]
B0009VEM4U [(3,2),(110,2),(121,5)...]
它表示ID 123
的单词在该文档中出现过2
次.现在,我想将其转移到像这样的向量中:
It denotes the word of ID 123
has occurred 2
times in that document. Now I'd like to transfer it to a vector like:
[0,0,0,.....,2,0,0,0,....,3,0,0,0,......1...]
#123rd 130th 159th
您如何做到这一点?预先谢谢你!
Do you how to do that? Thank you in advance!
推荐答案
首先,获取每行中的单词列表:
First, to get a list of words in every row:
df["review_text"] = df["review_text"].map(lambda x: x.split(' '))
现在您可以将df["review_text"]
传递给gensim的词典:
Now you can pass df["review_text"]
to gensim's Dictionary:
from gensim import corpora
dictionary = corpora.Dictionary(df["review_text"])
对于5000个最常用的单词,请使用filter_extremes方法:
For the 5000 most frequent words, use filter_extremes method:
dictionary.filter_extremes(no_below=1, no_above=1, keep_n=5000)
doc2bow方法将使您获得大量的单词表示形式(word_id,频率):
doc2bow method will get you the bag of words representation (word_id, frequency):
df["bow"] = df["review_text"].map(dictionary.doc2bow)
0 [(1, 2), (3, 1), (5, 1), (11, 1), (12, 3), (18...
1 [(0, 3), (24, 1), (28, 1), (30, 1), (56, 1), (...
2 [(8, 1), (15, 1), (18, 2), (29, 1), (36, 2), (...
3 [(69, 1), (94, 1), (115, 1), (123, 1), (128, 1...
4 [(2, 1), (18, 4), (26, 1), (32, 1), (55, 1), (...
5 [(6, 1), (18, 1), (30, 1), (61, 1), (71, 1), (...
6 [(0, 5), (13, 1), (18, 6), (31, 1), (42, 1), (...
7 [(0, 10), (5, 1), (18, 1), (35, 1), (43, 1), (...
8 [(0, 24), (1, 4), (4, 2), (7, 1), (10, 1), (14...
9 [(0, 7), (18, 3), (30, 1), (32, 1), (34, 1), (...
10 [(0, 5), (9, 1), (18, 3), (19, 1), (21, 1), (2...
在获得单词表示法之后,您可以在每行中合并该系列(可能不是很有效):
After getting the bag of words representation, you can concat the series in each row (probably not very efficient):
df2 = pd.concat([pd.DataFrame(s).set_index(0) for s in df["bow"]], axis=1).fillna(0).T.set_index(df.index)
0 1 2 3 4 5 6 7 8 9 ... 728 729 730 731 732 733 734 735 736 737
0 0 2 0 1 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 1 1 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 1 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
6 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
7 10 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
8 24 4 0 0 2 0 0 1 0 0 ... 1 1 2 0 1 3 1 0 1 0
9 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
10 5 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
这篇关于如何使用python-pandas和gensim将数据框中的单词映射到整数ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!