使用 numpy 或 pandas 从元组列表中为双元组创建频率矩阵 [英] Create a frequency matrix for bigrams from a list of tuples, using numpy or pandas
问题描述
我对 Python 非常陌生.我有一个元组列表,我在其中创建了二元组.
I am very new to Python. I have a list of tuples, where I created bigrams.
这个问题非常接近我的需求
my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]
现在我正在尝试将其转换为频率矩阵
Now I am trying to convert this into a frequency matrix
所需的输出是
consider of the to use we what words
consider 0 0 0 0 0 0 0 0
of 0 0 0 0 0 0 0 0
the 0 0 0 0 0 0 0 0
to 0 0 0 0 0 0 0 0
use 0 0 1 0 0 0 0 0
we 1 0 0 0 0 0 0 0
what 0 0 0 1 0 0 0 0
words 0 1 0 0 0 0 0 0
如何使用 numpy
或 pandas
执行此操作?不幸的是,我只能用 nltk
看到一些东西.
How to do this, using numpy
or pandas
? I can see something with nltk
only, unfortunately.
推荐答案
您可以创建频率数据框并按单词调用索引值:
You can create frequancy data frame and call index-values by words:
words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for i in my_list:
df.at[i[0],i[1]] += 1
输出:
consider of the to use we what words
consider 0 0 0 0 0 0 0 0
of 0 0 0 0 0 0 0 0
the 0 0 0 0 0 0 0 0
to 0 0 0 0 0 0 0 0
use 0 0 1 0 0 0 0 0
we 1 0 0 0 0 0 0 0
what 0 0 0 1 0 0 0 0
words 0 1 0 0 0 0 0 0
请注意,在此中,bigram 中的顺序很重要.如果你不关心顺序,你应该首先按内容对元组进行排序,使用这个:
Note that in this one, the order in the bigram matters. If you don't care about order, you should sort the tuples by their content first, using this:
my_list = [tuple(sorted(i)) for i in my_list]
另一种方法是使用 Counter
进行计数,但我希望它具有相似的性能(同样,如果 bigrams 中的顺序很重要,请从 中删除
):sorted
频率列表
Another way is to use Counter
to do the count, but I expect it to be similar performance(again if order in bigrams matters, remove sorted
from frequency_list
):
from collections import Counter
frequency_list = Counter(tuple(sorted(i)) for i in my_list)
words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for k,v in frequency_list.items():
df.at[k[0],k[1]] = v
输出:
consider of the to use we what words
consider 0 0 0 0 0 1 0 0
of 0 0 0 0 0 0 0 1
the 0 0 0 0 1 0 0 0
to 0 0 0 0 0 0 1 0
use 0 0 0 0 0 0 0 0
we 0 0 0 0 0 0 0 0
what 0 0 0 0 0 0 0 0
words 0 0 0 0 0 0 0 0
这篇关于使用 numpy 或 pandas 从元组列表中为双元组创建频率矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!